Unicode signature

Discussion of the xtools Toolkit

Moderators: Tom, Kukurykus

undavide

Unicode signature

Post by undavide »

Hi,
from Xbytor code:

Code: Select allStdlib.writeXMLFile = function(file, xml) {
  if (!(xml instanceof XML)) {
    Error.runtimeError(19, "xml"); // "Bad XML parameter";
  }
  file.encoding = "UTF8";
  file.open("w", "TEXT", "????");

  // unicode signature, this is UTF16 but will convert to UTF8 "EF BB BF"
  // optional
  // file.write("\uFEFF");
  file.lineFeed = "unix";
  file.write(xml.toXMLString());
  file.close();
};

What is the purpose of the unicode signature? It appears to be invisible when opening the file in a text editor - I assume it's good practice to include it, but I'm curious to know a bit more about it.

Thank you!

Davide Barranca
http://www.davidebarranca.com
Mikaeru

Unicode signature

Post by Mikaeru »

undavide wrote:Code: Select all  // unicode signature, this is UTF16 but will convert to UTF8 "EF BB BF"
  // optional
  // file.write("\uFEFF");

What is the purpose of the unicode signature? It appears to be invisible when opening the file in a text editor - I assume it's good practice to include it, but I'm curious to know a bit more about it.

It is called a Byte Order Mark (BOM) and is used to signal the endianness (byte order) of a text file or stream.

Quoted from JavaScript Tools Guide:
Unicode I/O
The data of some of the common file formats (UCS-2, UCS-4, UTF-8, UTF-16) starts with a special byte order
mark (BOM) character (\uFEFF). The File.open method reads a few bytes of a file looking for this
character. If it is found, the corresponding encoding is set automatically and the character is skipped. If
there is no BOM character at the beginning of the file, open() reads the first 2 KB of the file and checks
whether the data might be valid UTF-8 encoded data, and if so, sets the encoding to UTF-8.
To write 16-bit Unicode files in UTF-16 format, use the encoding UCS-2. This encoding uses whatever
byte-order format the host platform supports.
When using UTF-8 encoding or 16-bit Unicode, always write the BOM character "\uFEFF" as the first
character of the file.

File object functions
open ()
The method attempts to detect the encoding of the open file. It reads a few bytes at the current
location and tries to detect the Byte Order Mark character 0xFFFE. If found, the current position is
advanced behind the detected character and the encoding property is set to one of the strings
UCS-2BE, UCS-2LE, UCS4-BE, UCS-4LE, or UTF-8. If the marker character is not found, it checks for
zero bytes at the current location and makes an assumption about one of the above formats (except
UTF-8). If everything fails, the encoding property is set to the system encoding.

HTH,

--Mikaeru
xbytor

Unicode signature

Post by xbytor »

What Mikaeru said.

I had a very XML-heavy project where this was fairly important which is why that bit of text is there. However, since the xml header is '<?xml version="1.0" encoding="utf-8"?>' and we set the file encoding to UTF8, we don't really need the BOM.

Ooops! You have an older version of xtools.

The current version of that function looks like this:
Code: Select allStdlib.writeXMLFile = function(fptr, xml) {
  var rc;
  if (!(xml instanceof XML)) {
    Error.runtimeError(19, "xml"); // "Bad XML parameter";
  }

  var file = Stdlib.convertFptr(fptr);
  file.encoding = "UTF8";

  rc = file.open("w", "TEXT", "????");
  if (!rc && Stdlib.IOEXCEPTIONS_ENABLED) {
    Error.runtimeError(Stdlib.IO_ERROR_CODE, Stdlib.fileError(file));
  }

  // unicode signature, this is UTF16 but will convert to UTF8 "EF BB BF"
  // optional
  //file.write("\uFEFF");
  file.lineFeed = "unix";

  file.writeln('<?xml version="1.0" encoding="utf-8"?>');

  rc = file.write(xml.toXMLString());
  if (!rc && Stdlib.IOEXCEPTIONS_ENABLED) {
    Error.runtimeError(Stdlib.IO_ERROR_CODE, Stdlib.fileError(file));
  }

  rc = file.close();
  if (!rc && Stdlib.IOEXCEPTIONS_ENABLED) {
    Error.runtimeError(Stdlib.IO_ERROR_CODE, Stdlib.fileError(file));
  }

  return file;
};


BTW, the ExtendScript editor has options for writing a BOM into your jsx files.
undavide

Unicode signature

Post by undavide »

Thank you both!
So did I get it correctly if I say that writing the BOM and setting file.encoding is redundant?
X, I've extracted that StdLib function somewhere in a chunk of code I've found... ehm, can't remember where, I guess in the bottomless pit of PS-Scripts forums!
Thanks again,

Davide
xbytor

Unicode signature

Post by xbytor »

So did I get it correctly if I say that writing the BOM and setting file.encoding is redundant?

As long as you set the file.encoding and write the '<?xml version="1.0" encoding="utf-8"?>' header, you're set and you can safely ignore the BOM. The only time the BOM becomes relevant in this case is if you are doing something swishy with the file length and there is a BOM. I've got chunks of code laying around somewhere to deal with this.
undavide

Unicode signature

Post by undavide »

Got it, thank you!

Davide