Reading textual data

  • Revision slug: Reading_textual_data
  • Revision title: Reading textual data
  • Revision id: 96897
  • Created:
  • Creator: Biesi
  • Is current revision? No
  • Comment /* See also */ category: sample code

Revision Content

Reading textual data from streams, files and sockets

Warning: This article uses unfrozen interfaces. These interfaces may change in newer Mozilla versions and your code may stop working.

In order to read textual data, you need to know which character set the data is in. Files and network sockets contain bytes, not characters - to give these bytes a meaning, you need to know the character set.

Determining the character set of data

If you have a network channel (nsIChannel), you can try the contentCharset property of it. Note that not all channels know the charset of the data. You can fallback to the default charset stored in preferences (intl.charset.default, a string value)

When reading from a file, the question is harder to answer. Using the system character set may work (XXX insert text how to get it), or again the default charset from preferences.

Reading strings

Starting with Gecko 1.8 (SeaMonkey 1.0, Firefox 1.1), you can use nsIConverterInputStream to read strings from a stream. This work is done in bug 295047.

Usage:

var charset = /* Need to find out what the charset is. Using UTF-8 for this example: */ "UTF-8";
var is = Components.classes["@mozilla.org/intl/converter-input-stream;1"]
                   .createInstance(Components.interfaces.nsIConverterInputStream);
is.init(fis, charset, 1024, true);

Now you can read string from is:

var str = {};
var numChars = is.readString(4096, str);
if (numChars != 0 /* EOF */)
  var read_string = str.value;

To read the entire stream and do something with the data:

var str = {};
while (is.readString(4096, str) != 0) {
  processData(str.value);
}

Don't forget to close the stream when you're done with it (is.close()). Not doing so can cause problems if you try to rename or delete the file at a later time on some platforms.

Note that you may get less characters than you asked for, especially (but not only) at the end of the file (stream).

Reading Lines

There is not currently an easy, general way to read a unicode line from a stream.

For the limited use case of reading lines from a local file, the following code works. This code will not work for character sets that contain embedded nulls

// First, get and initialize the converter
var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                          .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = /* The charset you want, using UTF-8 here */ "UTF-8";

// This assumes that 'file' is a variable that contains the file you want to read, as an nsIFile
var fis = Components.classes["@mozilla.org/network/file-input-stream;1"]
                    .createInstance(Components.interfaces.nsIFileInputStream);
fis.init(file, -1, -1, 0);

var lis = fis.QueryInterface(Components.interfaces.nsILineInputStream);
var lineData = {};
var cont;
do {
  cont = lis.readLine(lineData);
  var line = converter.ConvertToUnicode(lineData.value);

  // Now you can do something with line

} while (cont);
fis.close();

Earlier versions

Earlier versions of gecko do not provide easy ways to read unicode data from a stream. You will have to manually read a block of data and convert it using nsIScriptableUnicodeConverter.

For example:

 // First, get and initialize the converter
 var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                           .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
 converter.charset = /* The charset you want, using UTF-8 here */ "UTF-8";
// Now, read from the stream
// This assumes istream is the stream you want to read from
var scriptableStream = Components.classes["@mozilla.org/scriptableinputstream;1"]
                                 .createInstance(Components.interfaces.nsIScriptableInputStream);
scriptableStream.init(istream);
var chunk = scriptableStream.read(4096);
var text = converter.ConvertToUnicode(chunk);

However, you must be aware that this method will not work for character sets that have embedded null bytes, such as UTF-16 or UCS-4.

See also

Writing textual data

Revision Source

<h2 name="Reading_textual_data_from_streams.2C_files_and_sockets"> Reading textual data from streams, files and sockets </h2>
<div style="border:red 1px inset; background-color: #dd0;">
<p><i>Warning</i>: This article uses unfrozen interfaces. These interfaces may change
in newer Mozilla versions and your code may stop working.
</p>
</div>
<p>In order to read textual data, you need to know which <b><a href="en/Character_set">character set</a></b> the
data is in. Files and network sockets contain bytes, not characters - to give
these bytes a meaning, you need to know the character set.
</p>
<h3 name="Determining_the_character_set_of_data"> Determining the character set of data </h3>
<p>If you have a network channel (<a href="en/NsIChannel">nsIChannel</a>), you can try the contentCharset
property of it. Note that not all channels know the charset of the data.
You can fallback to the default charset stored in preferences (<code>intl.charset.default</code>, a string value)
</p><p>When reading from a file, the question is harder to answer. Using the system
character set may work (XXX insert text how to get it), or again the default charset from preferences.
</p>
<h3 name="Reading_strings"> Reading strings </h3>
<p>Starting with Gecko 1.8 (SeaMonkey 1.0, Firefox 1.1), you can use
<code>nsIConverterInputStream</code> to read strings from a stream. This work is done
in <a class="external" href="https://bugzilla.mozilla.org/show_bug.cgi?id=295047">bug 295047</a>.
</p><p>Usage:
</p>
<pre class="eval">var charset = /* Need to find out what the charset is. Using UTF-8 for this example: */ "UTF-8";
var is = Components.classes["@mozilla.org/intl/converter-input-stream;1"]
                   .createInstance(Components.interfaces.nsIConverterInputStream);
is.init(fis, charset, 1024, true);
</pre>
<p>Now you can read string from <code>is</code>:
</p>
<pre class="eval">var str = {};
var numChars = is.readString(4096, str);
if (numChars != 0 /* EOF */)
  var read_string = str.value;
</pre>
<p>To read the entire stream and do something with the data:
</p>
<pre class="eval">var str = {};
while (is.readString(4096, str) != 0) {
  processData(str.value);
}
</pre>
<p>Don't forget to close the stream when you're done with it (<code>is.close()</code>). Not doing so can cause problems if you try to rename or delete the file at a later time on some platforms.
</p><p>Note that you may get less characters than you asked for, especially (but not only) at the end of the file (stream).
</p>
<h3 name="Reading_Lines"> Reading Lines </h3>
<p>There is not currently an easy, general way to read a unicode line from a stream.
</p><p>For the limited use case of reading lines from a local file, the following code works.
<b>This code will not work for character sets that contain embedded nulls</b>
</p>
<pre class="eval">// First, get and initialize the converter
var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                          .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
converter.charset = /* The charset you want, using UTF-8 here */ "UTF-8";

// This assumes that 'file' is a variable that contains the file you want to read, as an nsIFile
var fis = Components.classes["@mozilla.org/network/file-input-stream;1"]
                    .createInstance(Components.interfaces.nsIFileInputStream);
fis.init(file, -1, -1, 0);

var lis = fis.QueryInterface(Components.interfaces.nsILineInputStream);
var lineData = {};
var cont;
do {
  cont = lis.readLine(lineData);
  var line = converter.ConvertToUnicode(lineData.value);

  // Now you can do something with line

} while (cont);
fis.close();
</pre>
<h3 name="Earlier_versions"> Earlier versions </h3>
<p>Earlier versions of gecko do not provide easy ways to read unicode data from a stream.
You will have to manually read a block of data and convert it using nsIScriptableUnicodeConverter.
</p><p>For example:
</p>
<pre class="eval"> // First, get and initialize the converter
 var converter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"]
                           .createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
 converter.charset = /* The charset you want, using UTF-8 here */ "UTF-8";
</pre>
<pre class="eval">// Now, read from the stream
// This assumes istream is the stream you want to read from
var scriptableStream = Components.classes["@mozilla.org/scriptableinputstream;1"]
                                 .createInstance(Components.interfaces.nsIScriptableInputStream);
scriptableStream.init(istream);
var chunk = scriptableStream.read(4096);
var text = converter.ConvertToUnicode(chunk);
</pre>
<p>However, you must be aware that this method <b>will not work</b> for character sets that have embedded null bytes, such as UTF-16 or UCS-4.
</p>
<h2 name="See_also"> See also </h2>
<p><a href="en/Writing_textual_data">Writing textual data</a>
</p>
Revert to this revision