HTML5 Parser

Gecko 2 introduces a new parser, based on HTML5. The HTML parser is one of the most complicated and sensitive pieces of a browser. It controls how your HTML source code is turned into web pages and, as such, changes to it are rare. The new parser is faster, complies with the HTML5 standard, and enables a lot of new functionality as well.

The new parser introduces these major improvements:

  • You can now use SVG and MathML inline in HTML5 pages, without XML namespace syntax.
  • Parsing is now done in a separate thread from Firefox’s main UI thread, improving overall browser responsiveness.
  • Calls to innerHTML are a lot faster.
  • Dozens of long-standing parser related bugs are now fixed.

The HTML5 specification provides a more detailed description than previous HTML standards of how to turn a stream of bytes into a DOM tree. This will result in more consistent behavior across browser implementations. In other words, in supporting HTML5, Gecko, WebKit, and Internet Explorer (IE) will behave more consistently with each other.

Changed parser behaviors

Some changes to the way that the Gecko 2 parser behaves, as compared to earlier versions of Gecko, may affect web developers, depending on how you've written your code in the past and what browsers you've tested it on.

Tokenization of left angle-bracket within a tag

Given the string <foo<bar>, the new parser reads it as one tag named foo<bar. This behavior is consistent with IE and Opera, and is different from Gecko 1.x and WebKit, which read it as two tags, foo and bar. If you previously tested your code in IE and Opera, then you probably don't have any tags like this. If you tested your site only with Gecko 1.x or WebKit (for example, Firefox-only intranets or WebKit-oriented mobile sites), then you might have tags that match this pattern, and they will behave differently with Gecko 2.

Calling document.write() during parsing

Prior to HTML5, Gecko and WebKit allowed calls to document.write() during parsing to insert content into the source stream. This behavior was inherently racy, as the content was inserted into a timing-dependent point in the source stream. If the call happened after the parser was done, the inserted content replaced the document. In IE, such calls are either ignored or imply a call to, replacing the document. In HTML5, document.write() can only be called from a script that is created by a <script> tag that does not have the async or defer attributes set. With the HTML5 parser, calls to document.write() in any other context either are ignored or replace the document.

Some contexts from which you should not call document.write() include:

If you use the same mechanism for loading script libraries for all browsers including IE, then your code probably will not be affected by this change. Scripts that serve racy code to Firefox, perhaps while serving safe code to IE, will see a difference due to this change. Firefox writes a warning to the JavaScript console when it ignores a call to document.write().

Lack of Reparsing

Prior to HTML5, parsers reparsed the document if they hit the end of the file within certain elements or within comments. For example, if the document lacked a </title> closing tag, the parser would reparse to look for the first '<' in the document, or if a comment was not closed, it would look for the first '>'. This behavior created a security vulnerability. If an attacker could force a premature end-of-file, the parser might change which parts of the document it considered to be executable scripts. In addition, supporting reparsing led to unnecessarily complex parsing code.

With HTML5, parsers no longer reparse documents. This change has the following consequences for web developers:

  • If you omit the closing tag for <title>, <style>, <textarea>, or <xmp>, the page will fail to be parsed. IE already fails to parse documents with a missing </title> tag, so if you test with IE, you probably don't have that problem.
  • If you forget to close a comment, the page will most likely fail to be parsed. However, unclosed comments often already break in existing browsers for one reason or another, so it's unlikely that you have this issue in sites that are tested in multiple browsers.
  • In an inline script, in order to use the literal strings <script</script>, and <!--, you should prevent them from being parsed literally by expressing them as \u003cscript,\u003c/script>, and \u003c!--. The older practice of escaping the string </script> by surrounding it with comment markers, while supported by HTML5, is problematic in cases where the closing comment marker is omitted (see preceding point). You can avoid such problems by using the character code for the initial '<' instead. (It is valid to use an escape character, e.g., <\/script>, but this strategy does not work for <script and <!--, because \s and \! are not valid JavaScript escapes; the character code strategy is more general-purpose.)

Inline SVG and MathML support

As a completely new parsing feature, HTML5 introduced support for inline SVG and MathML in text/html. This means that you can now use SVG and MathML inline in text/html similarly to what has previously been possible in application/xhtml+xml.

  • <svg></svg> is assigned to the SVG namespace in the DOM.
  • <math></math> is assigned to the MathML namespace in the DOM.
  • foreignObject and annotation-xml (and various less important elements) establish a nested HTML scope, so you can nest SVG, MathML and HTML as you’d expect to be able to nest them.
  • The parser case-corrects markup so <SVG VIEWBOX='0 0 10 10'> works in HTML source.
  • The DOM methods and CSS selectors behave case-sensitively, so you need to write your DOM calls and CSS selectors using the canonical case, which is camelCase for various parts of SVG such as viewBox.
  • The syntax <foo/> opens and immediately closes the foo element if it is a MathML or SVG element (i.e. not an HTML element).
  • Attributes are tokenized the same way they are tokenized in HTML, so you can omit quotes in the same situations where you can omit quotes in HTML (i.e. when the attribute value is not the empty string and does not contain whitespace, ", ', `, <, =, or >).
  • Warning: the two above features do not combine well due to the reuse of legacy-compatible HTML tokenization. If you omit quotes on the last attribute value, you must have a space before the closing slash. <circle fill=green /> is OK but <circle fill=red/> is not.
  • Attributes starting with xmlns have absolutely no effect on what namespace elements or attributes end up in, so you don’t need to use attributes starting with xmlns.
  • Attributes in the XLink namespace must use the prefix xlink (e.g. xlink:href).
  • Element names must not have prefixes or colons in them.
  • The content of SVG script elements is tokenized like they are tokenized in XML—not like the content of HTML script elements is tokenized.
  • When an SVG or MathML element is open <![CDATA[]]> sections work the way they do in XML. You can use this to hide text content from older browsers that don’t support SVG or MathML in text/html.
  • The MathML named characters are available for use in named character references everywhere in the document (also in HTML content).
  • To deal with legacy pages where authors have pasted partial SVG fragments into HTML (who knows why) or used a <math> tag for non-MathML purposes, attempts to nest various common HTML elements as children of SVG elements (without foreignObject) will immediately break out of SVG or MathML context. This may make some typos have surprising effects.

Performance improvement with speculative parsing

Unrelated to the requirements of HTML5 specification, the Gecko 2 parser uses speculative parsing, in which it continues parsing a document while scripts are being downloaded and executed. This results in improved performance compared to older parsers, because most of the time, Gecko can complete these tasks in parallel.

To best take advantage of speculative parsing, and help your pages to load as quickly as possible, ensure that when you call document.write(), you write a balanced sub-tree within that chunk of script. A balanced sub-tree is HTML code in which any elements that are opened are also closed, so that after the script, the elements left open are the same ones that were open before the script. The open and closing tags do not need to be written by the same document.write() call, as long as they are within the same <script> tag.

Please note that you shouldn't use end tags for void elements that don't have end tags: <area>, <base>, <br>, <col>, <command>, <embed>, <hr>, <img>, <input>, <keygen>, <link>, <meta>, <param>, <source> and <wbr>. (There are also some element whose end tags can be omitted in some cases, such as <p> in the example below, but it's simpler to always use end tags for those elements than to make sure that the end tags are only omitted when they aren't necessary.)

For example, the following code writes a balanced sub-tree:

  document.write("<p>Some content goes here.</p>");
<!-- Non-script HTML goes here. -->

In contrast, the following code contains two scripts with unbalanced sub-trees, which causes speculative parsing to fail and therefore the time to parse the document is longer.

<p>Some content goes here.</p>

For more information, see Optimizing your pages for speculative parsing.