This is a blog describing something very trivial — lack of proper escaping of the single quote (U+0027) in many commonly used string escaping libraries — with very serious side effects (potential for cross site scripting).
If you are a developer or have some familiarity with web security, all you need to know is that the Apache commons StringEscapeUtils escapeHtml doesn’t escape the single quote, and that many other libraries using this method or replicating its functionality have the same defect. As an aside, the ECMAScript escape functions don’t follow the standard practice of escaping left and right brackets with \x3c and \x3e, but that is less serious of an issue because the forward slash is escaped, making it difficult (but not impossible) to break out of a quoted script tag. Nevertheless, this escaping is also insufficient.
I think the history of this issue is interesting. From the very beginning until the present, the single quote and double quotes were interchangeable in the specification of the HTML standards, which defined multiple (!) ways of escaping a character, or of specifying a character reference: one can reference the character with a decimal encoding (
<), a hex encoding (
<), and to make life easier for developers, a subset of commonly used characters was given an “entity” encoding, that consisted of an English language mnemonic (i.e. < for “less than”). Obviously most developers immediately began to use the entity encoding (it was designed to be easy to use), and pretty much forgot about the other encodings. However the single quote, even though it is an html control character and can be used to delineate html attribute values, was not assigned an entity name. To encode the single quote, one needed to use the decimal or hex ascii code with a numeric character reference. Now on the one hand, named entities were never intended to fulfill all encoding needs, but quickly to “html encode a character” became identified with “replace a character by its named html entity name”.
The result was two types of confusions. First, many developers came to be believe that the single quote was not an html control character at all.
Second, and more seriously, the original author(s) of the Apache StringEscapeUtils implemented a function that only replaced characters if they had a named entity. This was extremely inefficient because over time developers demanded many more named entities (such as the currency symbols, foreign language accents, math symbols, emoji, etc) so this function became much more expensive than it needs to be (really only
"<", ">", "&", "\"", "'" need to be escaped within valid html to safely escape characters within an html string). Also, because the single quote was not escaped, constructions such as the following JSP fragment:
<a href='#<% String.EscapeHtml(v.name) %>' > <% String.EscapeHtml(v.title) %> </a>
are vulnerable because if the name field is controlled by the attacker, they can break out of the href value by closing out the single quote and writing their own tags.
The independent browser vendors here are the good guys, by supporting an unofficial html entity ', however this entity was not a part of the standard and was not adopted by Internet Explorer (until HTML 5 support came along), therefore authors couldn’t use the ' to escape the single quote reliably. The widely used apache commons lang library continued to leave the single quote unescaped in its html string escaping routines. Many other string escaping libraries, such as those used in Spring MVC and Wicket copied the apache string escape library or re-implemented (sometimes with painstaking detail to include all the odd-ball smiley faces available) while continuing to ignore the single quote.
This lead to a raft of vulnerabilities in which third party code did not properly html escape user input in many languages. For example EJS, until recently, did not escape the single quote in its default html encoding.
If you are a maintainer of a framework or JSP/JSF stack, please check the implementation of your html escaping functions and make sure that you are encoding the single quote. If you are a user of such a framework, please verify that the encoding is working correctly or else you will need to introduce your own encoding function. A cursory glance at open source code repositories reveals many widely used projects with vulnerable code, particularly in the middleware space. You must look at the implementation in order to determine whether the single quote is escaped, as there are many functions with slightly different names that escape slightly different sets of html entities (all a waste) but will not escape the single quote at all.
Note that with the introduction of HTML 5, a named entity was finally assigned to the single quote (the old '). Nevertheless, to the best of my knowledge the commons lang libraries have not been updated, there is a large body of older libraries still in use, and of course it’s ridiculous to be making all the entity substitutions just to prevent an attacker from breaking out of a string. String escaping libraries need to prevent code from escaping from the string, and as they are often in the critical path they should do only that. If you choose the safer (and more expensive) option of encoding all characters, then you should be using numeric references anyways.