mozilla

Revision 118643 of Unicode

  • Revision slug: JavaScript/Guide/Obsolete_Pages/Unicode
  • Revision title: Unicode
  • Revision id: 118643
  • Created:
  • Creator: Mgjbot
  • Is current revision? No
  • Comment robot Adding: [[es:Guía JavaScript 1.5:Unicode]] <<langbot>>

Revision Content

Unicode

Unicode is a universal character-coding standard for the interchange and display of principal written languages. It covers the languages of the Americas, Europe, Middle East, Africa, India, Asia, and Pacifica, as well as historic scripts and technical symbols. Unicode allows for the exchange, processing, and display of multilingual texts, as well as the use of common technical and mathematical symbols. It hopes to resolve internationalization problems of multilingual computing, such as different national character standards. Not all modern or archaic scripts, however, are currently supported.

The Unicode character set can be used for all known encoding. Unicode is modeled after the ASCII (American Standard Code for Information Interchange) character set. It uses a numerical value and name for each character. The character encoding specifies the identity of the character and its numeric value (code position), as well as the representation of this value in bits. The 16-bit numeric value (code value) is defined by a hexadecimal number and a prefix U, for example, U+0041 represents A. The unique name for this value is LATIN CAPITAL LETTER A.

Unicode is not supported in versions of JavaScript prior to 1.3.

Unicode compatibility with ASCII and ISO

Unicode is fully compatible with the International Standard ISO/IEC 10646-1; 1993, which is a subset of ISO 10646.

Several encoding standards (including UTF-8, UTF-16 and ISO UCS-2) are used to physically represent Unicode as actual bits.

The UTF-8 encoding of Unicode is compatible with ASCII characters and is supported by many programs. The first 128 Unicode characters correspond to the ASCII characters and have the same byte value. The Unicode characters U+0020 through U+007E are equivalent to the ASCII characters 0x20 through 0x7E. Unlike ASCII, which supports the Latin alphabet and uses a 7-bit character set, UTF-8 uses between one and four octets for each character. ("Octet" meaning a byte, or 8 bits.) This allows for several million characters. An alternative encoding standard, UTF-16, uses two octets to represent Unicode characters. An escape sequence allows UTF-16 to represent the whole Unicode range by using four octets. The ISO UCS-2 (Universal Character Set) uses two octets.

JavaScript and Navigator support for UTF-8/Unicode means you can use non-Latin, international, and localized characters, plus special technical symbols in JavaScript programs. Unicode provides a standard way to encode multilingual text. Since the UTF-8 encoding of Unicode is compatible with ASCII, programs can use ASCII characters. You can use non-ASCII Unicode characters in the comments, string literals, identifiers, and regular expressions of JavaScript.

Unicode escape sequences

You can use the Unicode escape sequence in string literals, regular expressions, and identifiers. The escape sequence consists of six ASCII characters: \u and a four-digit hexadecimal number. For example, \u00A9 represents the copyright symbol. Every Unicode escape sequence in JavaScript is interpreted as one character.

The following code returns the copyright symbol and the string "Netscape Communications".

x="\u00A9 Netscape Communications"

The following table lists frequently used special characters and their Unicode value.

Category Unicode value Name Format name
White space values \u0009 Tab <TAB>
  \u000B Vertical Tab <VT>
  \u000C Form Feed <FF>
  \u0020 Space <SP>
Line terminator values \u000A Line Feed <LF>
  \u000D Carriage Return <CR>
Additional Unicode escape sequence values \u0008 Backspace <BS>
  \u0009 Horizontal Tab <HT>
  \u0022 Double Quote "
  \u0027 Single Quote '
  \u005C Backslash \

Table 2.2: Unicode values for special characters

The JavaScript use of the Unicode escape sequence is different from Java. In JavaScript, the escape sequence is never interpreted as a special character first. For example, a line terminator escape sequence inside a string does not terminate the string before it is interpreted by the function. JavaScript ignores any escape sequence if it is used in comments. In Java, if an escape sequence is used in a single comment line, it is interpreted as an Unicode character. For a string literal, the Java compiler interprets the escape sequences first. For example, if a line terminator escape character (e.g., \u000A) is used in Java, it terminates the string literal. In Java, this leads to an error, because line terminators are not allowed in string literals. You must use \n for a line feed in a string literal. In JavaScript, the escape sequence works the same way as \n.

Unicode characters in JavaScript files

Earlier versions of Gecko assumed the Latin-1 character encoding for JavaScript files loaded from XUL. Starting with Gecko 1.8, the character encoding is inferred from the XUL file's encoding. Please see International characters in XUL JavaScript for more information.

Displaying characters with Unicode

You can use Unicode to display the characters in different languages or technical symbols. For characters to be displayed properly, a client such as Mozilla Firefox or Netscape needs to support Unicode. Moreover, an appropriate Unicode font must be available to the client, and the client platform must support Unicode. Often, Unicode fonts do not display all the Unicode characters. Some platforms, such as Windows 95, provide partial support for Unicode.

To receive non-ASCII character input, the client needs to send the input as Unicode. Using a standard enhanced keyboard, the client cannot easily input the additional characters supported by Unicode. Sometimes, the only way to input Unicode characters is by using Unicode escape sequences.

For more information on Unicode, see the Unicode Home Page and The Unicode Standard, Version 2.0, published by Addison-Wesley, 1996.

{{template.PreviousNext("Core_JavaScript_1.5_Guide:Literals", "Core_JavaScript_1.5_Guide:Expressions")}}

{{ wiki.languages( { "es": "es/Gu\u00eda_JavaScript_1.5/Unicode", "fr": "fr/Guide_JavaScript_1.5/Unicode", "ja": "ja/Core_JavaScript_1.5_Guide/Unicode", "ko": "ko/Core_JavaScript_1.5_Guide/Unicode", "pl": "pl/Przewodnik_po_j\u0119zyku_JavaScript_1.5/Unicode" } ) }}

Revision Source

<p>
</p>
<h3 name="Unicode"> Unicode </h3>
<p>Unicode is a universal character-coding standard for the interchange and display of principal written languages. It covers the languages of the Americas, Europe, Middle East, Africa, India, Asia, and Pacifica, as well as historic scripts and technical symbols. Unicode allows for the exchange, processing, and display of multilingual texts, as well as the use of common technical and mathematical symbols. It hopes to resolve internationalization problems of multilingual computing, such as different national character standards. Not all modern or archaic scripts, however, are currently supported.
</p><p>The Unicode character set can be used for all known encoding. Unicode is modeled after the ASCII (American Standard Code for Information Interchange) character set. It uses a numerical value and name for each character. The character encoding specifies the identity of the character and its numeric value (code position), as well as the representation of this value in bits. The 16-bit numeric value (code value) is defined by a hexadecimal number and a prefix U, for example, U+0041 represents A. The unique name for this value is LATIN CAPITAL LETTER A.
</p><p><b>Unicode is not supported in versions of JavaScript prior to 1.3.</b>
</p>
<h4 name="Unicode_compatibility_with_ASCII_and_ISO"> Unicode compatibility with ASCII and ISO </h4>
<p>Unicode is fully compatible with the International Standard ISO/IEC 10646-1; 1993, which is a subset of ISO 10646.
</p><p>Several encoding standards (including UTF-8, UTF-16 and ISO UCS-2) are used to physically represent Unicode as actual bits.
</p><p>The UTF-8 encoding of Unicode is compatible with ASCII characters and is supported by many programs. The first 128 Unicode characters correspond to the ASCII characters and have the same byte value. The Unicode characters U+0020 through U+007E are equivalent to the ASCII characters 0x20 through 0x7E. Unlike ASCII, which supports the Latin alphabet and uses a 7-bit character set, UTF-8 uses between one and four octets for each character. ("Octet" meaning a byte, or 8 bits.) This allows for several million characters. An alternative encoding standard, UTF-16, uses two octets to represent Unicode characters. An escape sequence allows UTF-16 to represent the whole Unicode range by using four octets. The ISO UCS-2 (Universal Character Set) uses two octets.
</p><p>JavaScript and Navigator support for UTF-8/Unicode means you can use non-Latin, international, and localized characters, plus special technical symbols in JavaScript programs. Unicode provides a standard way to encode multilingual text. Since the UTF-8 encoding of Unicode is compatible with ASCII, programs can use ASCII characters. You can use non-ASCII Unicode characters in the comments, string literals, identifiers, and regular expressions of JavaScript.
</p>
<h4 name="Unicode_escape_sequences"> Unicode escape sequences </h4>
<p>You can use the Unicode escape sequence in string literals, regular expressions, and identifiers. The escape sequence consists of six ASCII characters: \u and a four-digit hexadecimal number. For example, \u00A9 represents the copyright symbol. Every Unicode escape sequence in JavaScript is interpreted as one character.
</p><p>The following code returns the copyright symbol and the string "Netscape Communications".
</p>
<pre>x="\u00A9 Netscape Communications"</pre>
<p>The following table lists frequently used special characters and their Unicode value.
</p>
<table class="fullwidth-table">
<tbody><tr>
<th>Category</th>
<th>Unicode value</th>
<th>Name</th>
<th>Format name</th>
</tr>
<tr>
<td>White space values</td>
<td>\u0009</td>
<td>Tab</td>
<td>&lt;TAB&gt;</td>
</tr>
<tr>
<td> </td>
<td>\u000B</td>
<td>Vertical Tab</td>
<td>&lt;VT&gt;</td>
</tr>
<tr>
<td> </td>
<td>\u000C</td>
<td>Form Feed</td>
<td>&lt;FF&gt;</td>
</tr>
<tr>
<td> </td>
<td>\u0020</td>
<td>Space</td>
<td>&lt;SP&gt;</td>
</tr>
<tr>
<td>Line terminator values</td>
<td>\u000A</td>
<td>Line Feed</td>
<td>&lt;LF&gt;</td>
</tr>
<tr>
<td> </td>
<td>\u000D</td>
<td>Carriage Return</td>
<td>&lt;CR&gt;</td>
</tr>
<tr>
<td>Additional Unicode escape sequence values</td>
<td>\u0008</td>
<td>Backspace</td>
<td>&lt;BS&gt;</td>
</tr>
<tr>
<td> </td>
<td>\u0009</td>
<td>Horizontal Tab</td>
<td>&lt;HT&gt;</td>
</tr>
<tr>
<td> </td>
<td>\u0022</td>
<td>Double Quote</td>
<td>"</td>
</tr>
<tr>
<td> </td>
<td>\u0027</td>
<td>Single Quote</td>
<td>'</td>
</tr>
<tr>
<td> </td>
<td>\u005C</td>  
<td>Backslash</td>
<td>\</td>
</tr>
</tbody></table>
<p><small><b>Table 2.2: Unicode values for special characters</b></small>
</p><p>The JavaScript use of the Unicode escape sequence is different from Java. In JavaScript, the escape sequence is never interpreted as a special character first. For example, a line terminator escape sequence inside a string does not terminate the string before it is interpreted by the function. JavaScript ignores any escape sequence if it is used in comments. In Java, if an escape sequence is used in a single comment line, it is interpreted as an Unicode character. For a string literal, the Java compiler interprets the escape sequences first. For example, if a line terminator escape character (e.g., \u000A) is used in Java, it terminates the string literal. In Java, this leads to an error, because line terminators are not allowed in string literals. You must use \n for a line feed in a string literal. In JavaScript, the escape sequence works the same way as \n.
</p>
<h4 name="Unicode_characters_in_JavaScript_files"> Unicode characters in JavaScript files </h4>
<p>Earlier versions of <a href="en/Gecko">Gecko</a> assumed the Latin-1 character encoding for JavaScript files loaded from XUL. Starting with Gecko 1.8, the character encoding is inferred from the XUL file's encoding. Please see <a href="en/International_characters_in_XUL_JavaScript">International characters in XUL JavaScript</a> for more information.
</p>
<h4 name="Displaying_characters_with_Unicode"> Displaying characters with Unicode </h4>
<p>You can use Unicode to display the characters in different languages or technical symbols. For characters to be displayed properly, a client such as Mozilla Firefox or Netscape needs to support Unicode. Moreover, an appropriate Unicode font must be available to the client, and the client platform must support Unicode. Often, Unicode fonts do not display all the Unicode characters. Some platforms, such as Windows 95, provide partial support for Unicode.
</p><p>To receive non-ASCII character input, the client needs to send the input as Unicode. Using a standard enhanced keyboard, the client cannot easily input the additional characters supported by Unicode. Sometimes, the only way to input Unicode characters is by using Unicode escape sequences.
</p><p>For more information on Unicode, see the <a class="external" href="http://www.unicode.org/">Unicode Home Page</a> and The Unicode Standard, Version 2.0, published by Addison-Wesley, 1996.
</p><p>{{template.PreviousNext("Core_JavaScript_1.5_Guide:Literals", "Core_JavaScript_1.5_Guide:Expressions")}}
</p>
<div class="noinclude">
</div>
{{ wiki.languages( { "es": "es/Gu\u00eda_JavaScript_1.5/Unicode", "fr": "fr/Guide_JavaScript_1.5/Unicode", "ja": "ja/Core_JavaScript_1.5_Guide/Unicode", "ko": "ko/Core_JavaScript_1.5_Guide/Unicode", "pl": "pl/Przewodnik_po_j\u0119zyku_JavaScript_1.5/Unicode" } ) }}
Revert to this revision