mozilla

Revision 96900 of Reading textual data

  • Revision slug: Reading_textual_data
  • Revision title: Reading textual data
  • Revision id: 96900
  • Created:
  • Creator: Jungshik
  • Is current revision? No
  • Comment charset-> character encoding (W3C/Unicode consortium standard term). UCS-4->UTF-32 (almost synonymous, but UTF-32 is better when talkig about character encodings

Revision Content

Reading textual data from streams, files and sockets

Warning: This article uses unfrozen interfaces. These interfaces may change in newer Mozilla versions and your code may stop working.

In order to read textual data, you need to know which character encoding the data is in. Files and network sockets contain bytes, not characters - to give these bytes a meaning, you need to know the character encoding.

Determining the character encoding of data

If you have a network channel (nsIChannel), you can try the contentCharset property of it. Note that not all channels know the character encoding of the data. You can fallback to the default character encoding stored in preferences (intl.charset.default, a string value)

When reading from a file, the question is harder to answer. Using the system character encoding may work (XXX insert text how to get it), or again the default character encoding from preferences.

Reading strings

Starting with Gecko 1.8 (SeaMonkey 1.0, Firefox 1.1), you can use nsIConverterInputStream to read strings from a stream. This work is done in bug 295047.

Usage:

var charset = /* Need to find out what the character encoding is. Using UTF-8 for this example: */ "UTF-8";
var is = Components.classes["@mozilla.org/intl/converter-input-stream;1"]
                   .createInstance(Components.interfaces.nsIConverterInputStream);
is.init(fis, charset, 1024, true);

Now you can read string from is:

var str = {};
var numChars = is.readString(4096, str);
if (numChars != 0 /* EOF */)
  var read_string = str.value;

To read the entire stream and do something with the data:

var str = {};
while (is.readString(4096, str) != 0) {
  processData(str.value);
}

Don't forget to close the stream when you're done with it (is.close()). Not doing so can cause problems if you try to rename or delete the file at a later time on some platforms.

Note that you may get less characters than you asked for, especially (but not only) at the end of the file (stream).

Unsupported byte sequences

Byte sequences that do not correspond to a valid character will get replaced by U+FFFD REPLACEMENT CHARACTER, if the last argument to init is true. Otherwise, readString will return an error when reaching the unsupported byte.

Reading Lines

There is not currently an easy, general way to read a unicode line from a stream.

For the limited use case of reading lines from a local file, the following code works. This code will not work for character encodings that contain embedded nulls such as UTF-16 and UTF-32

// First, get and initialize the converter
var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                          .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = /* The character encoding you want, using UTF-8 here */ "UTF-8";

// This assumes that 'file' is a variable that contains the file you want to read, as an nsIFile
var fis = Components.classes["@mozilla.org/network/file-input-stream;1"]
                    .createInstance(Components.interfaces.nsIFileInputStream);
fis.init(file, -1, -1, 0);

var lis = fis.QueryInterface(Components.interfaces.nsILineInputStream);
var lineData = {};
var cont;
do {
  cont = lis.readLine(lineData);
  var line = converter.ConvertToUnicode(lineData.value);

  // Now you can do something with line

} while (cont);
fis.close();

Earlier versions

Earlier versions of gecko do not provide easy ways to read unicode data from a stream. You will have to manually read a block of data and convert it using nsIScriptableUnicodeConverter.

For example:

 // First, get and initialize the converter
 var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                           .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
 converter.charset = /* The character encoding you want, using UTF-8 here */ "UTF-8";
// Now, read from the stream
// This assumes istream is the stream you want to read from
var scriptableStream = Components.classes["@mozilla.org/scriptableinputstream;1"]
                                 .createInstance(Components.interfaces.nsIScriptableInputStream);
scriptableStream.init(istream);
var chunk = scriptableStream.read(4096);
var text = converter.ConvertToUnicode(chunk);

However, you must be aware that this method will not work for character encodings that have embedded null bytes, such as UTF-16 or UTF-32.

See also

Writing textual data

Revision Source

<h2 name="Reading_textual_data_from_streams.2C_files_and_sockets"> Reading textual data from streams, files and sockets </h2>
<div style="border:red 1px inset; background-color: #dd0;">
<p><i>Warning</i>: This article uses unfrozen interfaces. These interfaces may change
in newer Mozilla versions and your code may stop working.
</p>
</div>
<p>In order to read textual data, you need to know which <b><a href="en/Character_encoding">character encoding</a></b> the
data is in. Files and network sockets contain bytes, not characters - to give
these bytes a meaning, you need to know the character encoding.
</p>
<h3 name="Determining_the_character_encoding_of_data"> Determining the character encoding of data </h3>
<p>If you have a network channel (<a href="en/NsIChannel">nsIChannel</a>), you can try the contentCharset
property of it. Note that not all channels know the character encoding of the data.
You can fallback to the default character encoding stored in preferences (<code>intl.charset.default</code>, a string value)
</p><p>When reading from a file, the question is harder to answer. Using the system
character encoding may work (XXX insert text how to get it), or again the default character encoding from preferences.
</p>
<h3 name="Reading_strings"> Reading strings </h3>
<p>Starting with Gecko 1.8 (SeaMonkey 1.0, Firefox 1.1), you can use
<code>nsIConverterInputStream</code> to read strings from a stream. This work is done
in <a class="external" href="https://bugzilla.mozilla.org/show_bug.cgi?id=295047">bug 295047</a>.
</p><p>Usage:
</p>
<pre class="eval">var charset = /* Need to find out what the character encoding is. Using UTF-8 for this example: */ "UTF-8";
var is = Components.classes["@mozilla.org/intl/converter-input-stream;1"]
                   .createInstance(Components.interfaces.nsIConverterInputStream);
is.init(fis, charset, 1024, true);
</pre>
<p>Now you can read string from <code>is</code>:
</p>
<pre class="eval">var str = {};
var numChars = is.readString(4096, str);
if (numChars != 0 /* EOF */)
  var read_string = str.value;
</pre>
<p>To read the entire stream and do something with the data:
</p>
<pre class="eval">var str = {};
while (is.readString(4096, str) != 0) {
  processData(str.value);
}
</pre>
<p>Don't forget to close the stream when you're done with it (<code>is.close()</code>). Not doing so can cause problems if you try to rename or delete the file at a later time on some platforms.
</p><p>Note that you may get less characters than you asked for, especially (but not only) at the end of the file (stream).
</p>
<h4 name="Unsupported_byte_sequences"> Unsupported byte sequences </h4>
<p>Byte sequences that do not correspond to a valid character will get replaced by U+FFFD REPLACEMENT CHARACTER, if the last argument to init is true. Otherwise, <code>readString</code> will return an error when reaching the unsupported byte.
</p>
<h3 name="Reading_Lines"> Reading Lines </h3>
<p>There is not currently an easy, general way to read a unicode line from a stream.
</p><p>For the limited use case of reading lines from a local file, the following code works.
<b>This code will not work for character encodings that contain embedded nulls</b> such as UTF-16 and UTF-32
</p>
<pre class="eval">// First, get and initialize the converter
var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                          .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = /* The character encoding you want, using UTF-8 here */ "UTF-8";

// This assumes that 'file' is a variable that contains the file you want to read, as an nsIFile
var fis = Components.classes["@mozilla.org/network/file-input-stream;1"]
                    .createInstance(Components.interfaces.nsIFileInputStream);
fis.init(file, -1, -1, 0);

var lis = fis.QueryInterface(Components.interfaces.nsILineInputStream);
var lineData = {};
var cont;
do {
  cont = lis.readLine(lineData);
  var line = converter.ConvertToUnicode(lineData.value);

  // Now you can do something with line

} while (cont);
fis.close();
</pre>
<h3 name="Earlier_versions"> Earlier versions </h3>
<p>Earlier versions of gecko do not provide easy ways to read unicode data from a stream.
You will have to manually read a block of data and convert it using nsIScriptableUnicodeConverter.
</p><p>For example:
</p>
<pre class="eval"> // First, get and initialize the converter
 var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                           .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
 converter.charset = /* The character encoding you want, using UTF-8 here */ "UTF-8";
</pre>
<pre class="eval">// Now, read from the stream
// This assumes istream is the stream you want to read from
var scriptableStream = Components.classes["@mozilla.org/scriptableinputstream;1"]
                                 .createInstance(Components.interfaces.nsIScriptableInputStream);
scriptableStream.init(istream);
var chunk = scriptableStream.read(4096);
var text = converter.ConvertToUnicode(chunk);
</pre>
<p>However, you must be aware that this method <b>will not work</b> for character encodings that have embedded null bytes, such as UTF-16 or UTF-32.
</p>
<h2 name="See_also"> See also </h2>
<p><a href="en/Writing_textual_data">Writing textual data</a>
</p>
Revert to this revision