Parser HTML5

O Gecko 2 introduz um novo analizador (parser), baseado no HTML5. O analizador (parser) HTML é uma das peças mais sensíveis e complicadas de um navegador. Ele controla como o código fonte HTML é transformado em páginas web e, também, mudanças para alguns casos de exceções. O novo analisador (parser) é mais rápido, cumpre com o padrão do HTML5, e habilita também várias novas funcionalidades.

O novo analisador (parser) introduz estas melhorias maiores:

  • Você pode agora usar o SVG e o MathML em linha em páginas HTML5, sem a sintaxe namespace XML.
  • O analizador (parsing) está completo em um processo separado do processo de interface de usuário do Firefox, melhorando a responsividade do navegador como um todo.
  • Chamadas para innerHTML estão muito mais rápidas.
  • Dúzias de bugs relacionados à análise de longa data (en) estão agora corrigidas.

A especificação do HTML5 (en) fornece descrições mais detalhadas do que os padrões anteriores do HTML de como transformar um fluxo de bytes em uma árvore DOM. Isto resultará em implementações mais consistentes do navegador. Em outras palavras, o suporte ao HTML5 no Gecko, WebKit e Internet Explorer (IE) se tornará mais consistente um com o outro.

Changed parser behaviors

Some changes to the way that the Gecko 2 parser behaves, as compared to earlier versions of Gecko, may affect web developers, depending on how you've written your code in the past and what browsers you've tested it on.

Tokenization of left angle-bracket within a tag

Given the string <foo<bar>, the new parser reads it as one tag named foo<bar. This behavior is consistent with IE and Opera, and is different from Gecko 1.x and WebKit, which read it as two tags, foo and bar. If you previously tested your code in IE and Opera, then you probably don't have any tags like this. If you tested your site only with Gecko 1.x or WebKit (for example, Firefox-only intranets or WebKit-oriented mobile sites), then you might have tags that match this pattern, and they will behave differently with Gecko 2.

Calling document.write() during parsing

Prior to HTML5, Gecko and WebKit allowed calls to document.write() during parsing to insert content into the source stream. This behavior was inherently racy, as the content was inserted into a timing-dependent point in the source stream. If the call happened after the parser was done, the inserted content replaced the document. In IE, such calls are either ignored or imply a call to, replacing the document. In HTML5, document.write() can only be called from a script that is created by a <script> (en-US) tag that does not have the async or defer attributes set. With the HTML5 parser, calls to document.write() in any other context either are ignored or replace the document.

Some contexts from which you should not call document.write() include:

If you use the same mechanism for loading script libraries for all browsers including IE, then your code probably will not be affected by this change. Scripts that serve racy code to Firefox, perhaps while serving safe code to IE, will see a difference due to this change. Firefox writes a warning to the JavaScript console when it ignores a call to document.write().

Lack of Reparsing

Prior to HTML5, parsers reparsed the document if they hit the end of the file within certain elements or within comments. For example, if the document lacked a </title> closing tag, the parser would reparse to look for the first '<' in the document, or if a comment was not closed, it would look for the first '>'. This behavior created a security vulnerability. If an attacker could force a premature end-of-file, the parser might change which parts of the document it considered to be executable scripts. In addition, supporting reparsing led to unnecessarily complex parsing code.

With HTML5, parsers no longer reparse documents. This change has the following consequences for web developers:

  • If you omit the closing tag for <title>, <style>, <textarea>, or <xmp>, the page will fail to be parsed. IE already fails to parse documents with a missing </title> tag, so if you test with IE, you probably don't have that problem.
  • If you forget to close a comment, the page will most likely fail to be parsed. However, unclosed comments often already break in existing browsers for one reason or another, so it's unlikely that you have this issue in sites that are tested in multiple browsers.
  • In an inline script, in order to use the literal strings </script> and <!--, you should prevent them from being parsed literally by expressing them as \u003c/script> and \u003c!--. The older practice of escaping the string </script> by surrounding it with comment markers, while supported by HTML5, is problematic in cases where the closing comment marker is omitted (see preceding point). You can avoid such problems by using the character code for the initial '<' instead.

Performance improvement with speculative parsing

Unrelated to the requirements of HTML5 specification, the Gecko 2 parser uses speculative parsing, in which it continues parsing a document while scripts are being downloaded and executed. This results in improved performance compared to older parsers, because most of the time, Gecko can complete these tasks in parallel.

To best take advantage of speculative parsing, and help your pages to load as quickly as possible, ensure that when you call document.write(), you write a balanced sub-tree within that chunk of script. A balanced sub-tree is HTML code in which any elements that are opened are also closed, so that after the script, the elements left open are the same ones that were open before the script. The open and closing tags do not need to be written by the same document.write() call, as long as they are within the same <script> tag.

Please note that you shouldn't use end tags for void elements that don't have end tags: <area> (en-US), <base> (en-US), <br> (en-US), <col> (en-US), <command>, <embed> (en-US), <hr> (en-US), <img> (en-US), <input> (en-US), <keygen> (en-US), <link> (en-US), <meta> (en-US), <param> (en-US), <source> (en-US) and <wbr> (en-US). (There are also some element whose end tags can be omitted in some cases, such as <p> (en-US) in the example below, but it's simpler to always use end tags for those elements than to make sure that the end tags are only omitted when they aren't necessary.)

For example, the following code writes a balanced sub-tree:

  document.write("<p>Some content goes here.</p>");
<!-- Non-script HTML goes here. -->

In contrast, the following code contains two scripts with unbalanced sub-trees, which causes speculative parsing to fail and therefore the time to parse the document is longer.

<p>Some content goes here.</p>

For more information, see Optimizing your pages for speculative parsing.