File this one under "the devil is in the details."
While doing some Python hacking, I found this interesting article on a library called defusedxml which defends Python against some XML processing based attacks.
I was curious about how it might affect PHP (which, like most scripting languages uses libxml), and I found this StackOverflow thread on trying to avoid these same attacks.
There are two related vulnerabilities, and both are pretty old (almost 10 years). One's called "Billion Laughs" and the other is "quadratic blowup."
The short of it is that a specially-crafted XML document can make a server consume a LOT of memory very quickly. This attack is courtesy of XML entities (like & in (X)HTML). Part of XML is that you can create a bunch of random entities at the start of a document and the parser will try to do something sane with them.
If you make a document where
&a; expands to
aaaaa… 10,000 times then it's pretty easy to see how it can eat up memory.
And if you make
&a; expand to
&b; which expands back to
&a; you get a loop, and the parser just goes away and chews up CPU and memory till the cows come home. Nowadays parsers do recursion checks to avoid this, but certain flags can still make the problem happen.
How to fix it
Does your webapp receive and process XML? You'd better know how your language/framework handles XML and whether or not you can get hit by this.
It looks like the most common approach is to use the parser to remove the DOCTYPE directive at the start of the document. Pretty easy, as it shows up as a separate node in the DOM. Most parsers will protect against the "billion laughs" attack, but it's probably worth writing a test for that too.
The root of the problem: XML isn't strictly a data format
XML is also a document structure specification language (that's what the DOCTYPE block is for). It's also a way to mix different types documents together (namespaces and DTDs). Depending on what you build, you might need some of these features to do your job.
That said… if you don't need those features, turn them off, or use a library that does so automatically (like defusedxml).
Your Rails friends might seem a little grouchy lately — they found out the hard way that YAML is not strictly a data format. It's also a serialization) format. Without safeguards, the YAML parser will happily create new objects, any object you like! How about creating some objects that run shell commands on the remote server? Too easy. It's as if someone printed in big letters, "what do you want to hack today?" (See the Kalzumeus article I link to at the end of this if you don't think this is a big deal).
Et tu, JSON?
Actually, JSON is pretty safe. The main danger is receiving a huge and deeply nested document, but most JSON implementations are efficient enough to handle those. It might make sense to put a size limit on the JSON document accepted (are ya really gonna need to accept JSON docs bigger than a few hundred kilobytes?)
JSON is strictly a data format. It doesn't support self-references, and there's only one type of document — a JSON object containing strings, arrays, and other objects. JSON makes no effort to link its objects to actual program objects. If you want to do that, you're on your own, but at least you'll have fine-grained control over where you send the the user input.
There are more attacks on XML: XML Denial of Service Attacks and Defenses
Even if you're not a Rails shop, understanding the YAML-parsing vulnerability works should be a part of your security efforts.
Update (21 Feb 2013):
My colleague reminds me that JSON is not entirely safe, at least where there's a browser involved. Forgot about this one!