mozilla

Revision 92643 of HTML5 Parser

  • Revision slug: HTML/HTML5/HTML5_Parser
  • Revision title: HTML5 Parser
  • Revision id: 92643
  • Created:
  • Creator: Hsivonen
  • Is current revision? No
  • Comment The SVG and MathML elements are in namespaces in the DOM; 2 words added, 1 words removed

Revision Content

{{ gecko_minversion_header("2") }}{{ draft() }}

Gecko 2 introduces a new parser, based on HTML5. The HTML parser is one of the most complicated and sensitive pieces of a browser. It controls how your HTML source code is turned into web pages and, as such, changes to it are rare. The new parser is faster, complies with the HTML5 standard, and enables a lot of new functionality as well.

The new parser introduces these major improvements:

  • You can now use SVG and MathML inline in HTML5 pages, without XML namespace syntax.
  • Parsing is now done in a separate thread from Firefox’s main UI thread, improving overall browser responsiveness.
  • Calls to innerHTML are a lot faster.
  • Dozens of long-standing parser related bugs are now fixed.

The HTML5 specification provides a more detailed description than previous HTML standards of how to turn a stream of bytes into a DOM tree. This will result in more consistent behavior across browser implementations. In other words, in supporting HTML5, Gecko, WebKit, and Internet Explorer (IE) will behave more consistently with each other.

Changed parser behaviors

Some changes to the way that the Gecko 2 parser behaves, as compared to earlier versions of Gecko, may affect web developers, depending on how you've written your code in the past and what browsers you've tested it on.

Tokenization of left angle-bracket within a tag

Given the string <foo<bar>, the new parser reads it as one tag named foo<bar. This behavior is consistent with IE and Opera, and is different from Gecko 1.x and WebKit, which read it as two tags, foo and bar. If you previously tested your code in IE and Opera, then you probably don't have any tags like this. If you tested your site only with Gecko 1.x or WebKit (for example, Firefox-only intranets or WebKit-oriented mobile sites), then you might have tags that match this pattern, and they will behave differently with Gecko 2.

Calling document.write() during parsing

Prior to HTML5, Gecko and WebKit allowed calls to document.write() during parsing to insert content into the source stream. This behavior was inherently racy, as the content was inserted into a timing-dependent point in the source stream. If the call happened after the parser was done, the inserted content replaced the document. In IE, such calls are either ignored or imply a call to document.open(), replacing the document. In HTML5, document.write() can only be called from a script that is created by a {{ HTMLElement("script") }} tag that does not have the async or defer attributes set. With the HTML5 parser, calls to document.write() in any other context either are ignored or replace the document.

Some contexts from which you should not call document.write() include:

If you use the same mechanism for loading script libraries for all browsers including IE, then your code probably will not be affected by this change. Scripts that serve racy code to Firefox, perhaps while serving safe code to IE, will see a difference due to this change. Firefox writes a warning to the JavaScript console when it ignores a call to document.write().

Lack of Reparsing

Prior to HTML5, parsers reparsed the document if they hit the end of the file within certain elements or within comments. For example, if the document lacked a </title> closing tag, the parser would reparse to look for the first '<' in the document, or if a comment was not closed, it would look for the first '>'. This behavior created a security vulnerability. If an attacker could force a premature end-of-file, the parser might change which parts of the document it considered to be executable scripts. In addition, supporting reparsing led to unnecessarily complex parsing code.

With HTML5, parsers no longer reparse documents. This change has the following consequences for web developers:

  • If you omit the closing tag for <title>, <style>, <textarea>, or <xmp>, the page will fail to be parsed. IE already fails to parse documents with a missing </title> tag, so if you test with IE, you probably don't have that problem.
  • If you forget to close a comment, the page will most likely fail to be parsed. However, unclosed comments often already break in existing browsers for one reason or another, so it's unlikely that you have this issue in sites that are tested in multiple browsers.
  • In an inline script, in order to use the literal strings </script> and <!--, you should prevent them from being parsed literally by expressing them as \u003c/script> and \u003c!--. The older practice of escaping the string </script> by surrounding it with comment markers, while supported by HTML5, is problematic in cases where the closing comment marker is omitted (see preceding point). You can avoid such problems by using the character code for the initial '<' instead.

Performance improvement with speculative parsing

Unrelated to the requirements of HTML5 specification, the Gecko 2 parser uses speculative parsing, in which it continues parsing a document while scripts are being downloaded and executed. This results in improved performance compared to older parsers, because most of the time, Gecko can complete these tasks in parallel.

To best take advantage of speculative parsing, and help your pages to load as quickly as possible, ensure that when you call document.write(), you write a balanced sub-tree within that chunk of script. A balanced sub-tree is HTML code in which any elements that are opened are also closed so that after the script, the elements left open are the same ones that were open before the script. The open and closing tags do not need to be written by the same document.write() call, as long as they are within the same <script> tag.

Please note that you shouldn't use end tags for void elements that don't have end tags: area, base, br, col, command, embed, hr, img, input, keygen, link, meta, param, source and wbr. (There are also some element whose end tags can be omitted in some cases, such as p in the example below, but it's simpler to always use end tags for those elements than to make sure that the end tags are only omitted when they aren't necessary.)

For example, the following code writes a balanced sub-tree:

<script>
  document.write("<div>");
  document.write("<p>Some content goes here.</p>");
  document.write("</div>");
</script>
<!-- Non-script HTML goes here. -->

In contrast, the following code contains two scripts with unbalanced sub-trees, which causes speculative parsing to fail and therefore the time to parse the document is longer.

<script>document.write("<div>");</script>
<p>Some content goes here.</p>
<script>document.write("</div>");</script>

For more information, see Optimizing your pages for speculative parsing.

Revision Source

<p>{{ gecko_minversion_header("2") }}{{ draft() }}</p>
<p>Gecko 2 introduces a new parser, based on HTML5. The HTML parser is one of the most complicated and sensitive pieces of a browser. It controls how your HTML source code is turned into web pages and, as such, changes to it are rare. The new parser is faster, complies with the HTML5 standard, and enables a lot of new functionality as well.</p>
<p>The new parser introduces these major improvements:</p>
<ul> <li>You can now use SVG and MathML inline in HTML5 pages, without XML namespace syntax.</li> <li>Parsing is now done in a separate thread from Firefox’s main UI thread, improving overall browser responsiveness.</li> <li>Calls to <code>innerHTML</code> are a lot faster.</li> <li><a class=" external" href="http://tinyurl.com/html-parser-bugs" title="https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=substring&amp;status_whiteboard=[fixed%20by%20the%20HTML5%20parser]&amp;resolution=FIXED">Dozens of long-standing parser related bugs</a> are now fixed.</li>
</ul>
<p>The <a class=" external" href="http://www.w3.org/TR/html5/" title="http://www.w3.org/TR/html5/">HTML5 specification</a> provides a more detailed description than previous HTML standards of how to turn a stream of bytes into a DOM tree. This will result in more consistent behavior across browser implementations. In other words, in supporting HTML5, Gecko, WebKit, and Internet Explorer (IE) will behave more consistently with each other.</p>
<h2>Changed parser behaviors</h2>
<p>Some changes to the way that the Gecko 2 parser behaves, as compared to earlier versions of Gecko, may affect web developers, depending on how you've written your code in the past and what browsers you've tested it on.</p>
<h3>Tokenization of left angle-bracket within a tag</h3>
<p>Given the string <code>&lt;foo&lt;bar&gt;</code>, the new parser reads it as one tag named <code>foo&lt;bar</code>. This behavior is consistent with IE and Opera, and is different from Gecko 1.x and WebKit, which read it as two tags, <code>foo</code> and <code>bar</code>. If you previously tested your code in IE and Opera, then you probably don't have any tags like this. If you tested your site only with Gecko 1.x or WebKit (for example, Firefox-only intranets or WebKit-oriented mobile sites), then you might have tags that match this pattern, and they will behave differently with Gecko 2.</p>
<h3>Calling document.write() during parsing</h3>
<p>Prior to HTML5, Gecko and WebKit allowed calls to <a href="/en/DOM/document.write" title="en/DOM/document.write"><code>document.write()</code></a> <em>during parsing</em> to insert content into the source stream. This behavior was inherently <a class=" external" href="http://en.wikipedia.org/wiki/Race_condition" title="http://en.wikipedia.org/wiki/Race_condition">racy</a>, as the content was inserted into a timing-dependent point in the source stream. If the call happened after the parser was done, the inserted content replaced the document. In IE, such calls are either ignored or imply a call to <a href="/en/DOM/document.open" title="en/DOM/document.open"><code>document.open()</code></a>, replacing the document. In HTML5, <code>document.write()</code> can only be called from a script that is created by a {{ HTMLElement("script") }} tag that does not have the <code><a href="/En/HTML/Element/Script#attr_async" title="En/HTML/Element/Script#attr async">async</a></code> or <code><a href="/En/HTML/Element/Script#attr_defer" title="En/HTML/Element/Script#attr defer">defer</a></code> attributes set. With the HTML5 parser, calls to <code>document.write()</code> in any other context either are ignored or replace the document.</p>
<p>Some contexts from which you should <em>not</em> call <code>document.write()</code> include:</p>
<ul> <li>scripts created using <a href="/en/DOM/document.createElement" title="en/DOM/document.createElement">document.createElement()</a></li> <li>event handlers</li> <li><a href="/en/DOM/window.setTimeout" title="en/DOM/window.setTimeout">setTimeout()</a></li> <li><a href="/en/DOM/window.setInterval" title="en/DOM/window.setInterval">setInterval()</a></li> <li><code>&lt;script async src="..."&gt;</code></li> <li><code>&lt;script defer src="..."&gt;</code></li>
</ul>
<p>If you use the same mechanism for loading script libraries for all browsers including IE, then your code probably will not be affected by this change. Scripts that serve racy code to Firefox, perhaps while serving safe code to IE, will see a difference due to this change. Firefox writes a warning to the JavaScript console when it ignores a call to <code>document.write()</code>.</p>
<h3>Lack of Reparsing</h3>
<p>Prior to HTML5, parsers reparsed the document if they hit the end of the file within certain elements or within comments. For example, if the document lacked a <code>&lt;/title&gt;</code> closing tag, the parser would reparse to look for the first '&lt;' in the document, or if a comment was not closed, it would look for the first '&gt;'. This behavior created a security vulnerability. If an attacker could force a premature end-of-file, the parser might change which parts of the document it considered to be executable scripts. In addition, supporting reparsing led to unnecessarily complex parsing code.</p>
<p>With HTML5, parsers no longer reparse documents. This change has the following consequences for web developers:</p>
<ul> <li>If you omit the closing tag for &lt;title&gt;, &lt;style&gt;, &lt;textarea&gt;, or &lt;xmp&gt;, the page <em>will</em> fail to be parsed. IE already fails to parse documents with a missing &lt;/title&gt; tag, so if you test with IE, you probably don't have that problem.</li> <li>If you forget to close a comment, the page will most likely fail to be parsed. However, unclosed comments often already break in existing browsers for one reason or another, so it's unlikely that you have this issue in sites that are tested in multiple browsers.</li> <li>In an inline script, in order to use the literal strings <code>&lt;/script&gt;</code> and <code>&lt;!--</code>, you should prevent them from being parsed literally by expressing them as <code>\u003c/script&gt;</code> and <code>\u003c!--</code>. The older practice of escaping the string <code>&lt;/script&gt;</code> by surrounding it with comment markers, while supported by HTML5, is problematic in cases where the closing comment marker is omitted (see preceding point). You can avoid such problems by using the character code for the initial '&lt;' instead.</li>
</ul>
<h2>Performance improvement with speculative parsing</h2>
<p>Unrelated to the requirements of HTML5 specification, the Gecko 2 parser uses <em>speculative parsing</em>, in which it continues parsing a document while scripts are being downloaded and executed. This results in improved performance compared to older parsers, because most of the time, Gecko can complete these tasks in parallel.</p>
<p>To best take advantage of speculative parsing, and help your pages to load as quickly as possible, ensure that when you call <a href="/en/DOM/document.write" title="en/DOM/document.write">document.write()</a>, you write a <em>balanced sub-tree</em> within that chunk of script. A balanced sub-tree is HTML code in which any elements that are opened are also closed so that after the script, the elements left open are the same ones that were open before the script. The open and closing tags do not need to be written by the same <code>document.write()</code> call, as long as they are within the same <code>&lt;script&gt;</code> tag.</p>
<p>Please note that you shouldn't use end tags for void elements that don't have end tags: <code>area</code>, <code>base</code>, <code>br</code>, <code>col</code>, <code>command</code>, <code>embed</code>, <code>hr</code>, <code>img</code>, <code>input</code>, <code>keygen</code>, <code>link</code>, <code>meta</code>, <code>param</code>, <code>source</code> and <code>wbr</code>. (There are also some element whose end tags can be omitted in some cases, such as <code>p</code> in the example below, but it's simpler to always use end tags for those elements than to make sure that the end tags are only omitted when they aren't necessary.)</p>
<p>For example, the following code writes a balanced sub-tree:</p>
<pre>&lt;script&gt;
  document.write("&lt;div&gt;");
  document.write("&lt;p&gt;Some content goes here.&lt;/p&gt;");
  document.write("&lt;/div&gt;");
&lt;/script&gt;
&lt;!-- Non-script HTML goes here. --&gt;
</pre>
<p>In contrast, the following code contains two scripts with unbalanced sub-trees, which causes speculative parsing to fail and therefore the time to parse the document is longer.</p>
<pre>&lt;script&gt;document.write("&lt;div&gt;");&lt;/script&gt;
&lt;p&gt;Some content goes here.&lt;/p&gt;
&lt;script&gt;document.write("&lt;/div&gt;");&lt;/script&gt;
</pre>
<p>For more information, see <a href="/en/HTML/HTML5/Optimizing_Your_Pages_for_Speculative_Parsing" title="en/Optimizing Your Pages for Speculative Parsing">Optimizing your pages for speculative parsing</a>.</p>
Revert to this revision