Page 1 of 1

Unicode signature

Posted: Fri Sep 20, 2013 10:28 am
by undavide
Hi,
from Xbytor code:

Code: Select allStdlib.writeXMLFile = function(file, xml) {
  if (!(xml instanceof XML)) {
    Error.runtimeError(19, "xml"); // "Bad XML parameter";
  }
  file.encoding = "UTF8";
  file.open("w", "TEXT", "????");

  // unicode signature, this is UTF16 but will convert to UTF8 "EF BB BF"
  // optional
  // file.write("\uFEFF");
  file.lineFeed = "unix";
  file.write(xml.toXMLString());
  file.close();
};

What is the purpose of the unicode signature? It appears to be invisible when opening the file in a text editor - I assume it's good practice to include it, but I'm curious to know a bit more about it.

Thank you!

Davide Barranca
http://www.davidebarranca.com

Unicode signature

Posted: Fri Sep 20, 2013 11:44 am
by Mikaeru
undavide wrote:Code: Select all  // unicode signature, this is UTF16 but will convert to UTF8 "EF BB BF"
  // optional
  // file.write("\uFEFF");

What is the purpose of the unicode signature? It appears to be invisible when opening the file in a text editor - I assume it's good practice to include it, but I'm curious to know a bit more about it.

It is called a Byte Order Mark (BOM) and is used to signal the endianness (byte order) of a text file or stream.

Quoted from JavaScript Tools Guide:
Unicode I/O
The data of some of the common file formats (UCS-2, UCS-4, UTF-8, UTF-16) starts with a special byte order
mark (BOM) character (\uFEFF). The File.open method reads a few bytes of a file looking for this
character. If it is found, the corresponding encoding is set automatically and the character is skipped. If
there is no BOM character at the beginning of the file, open() reads the first 2 KB of the file and checks
whether the data might be valid UTF-8 encoded data, and if so, sets the encoding to UTF-8.
To write 16-bit Unicode files in UTF-16 format, use the encoding UCS-2. This encoding uses whatever
byte-order format the host platform supports.
When using UTF-8 encoding or 16-bit Unicode, always write the BOM character "\uFEFF" as the first
character of the file.

File object functions
open ()
The method attempts to detect the encoding of the open file. It reads a few bytes at the current
location and tries to detect the Byte Order Mark character 0xFFFE. If found, the current position is
advanced behind the detected character and the encoding property is set to one of the strings
UCS-2BE, UCS-2LE, UCS4-BE, UCS-4LE, or UTF-8. If the marker character is not found, it checks for
zero bytes at the current location and makes an assumption about one of the above formats (except
UTF-8). If everything fails, the encoding property is set to the system encoding.

HTH,

--Mikaeru

Unicode signature

Posted: Sat Sep 21, 2013 5:03 am
by xbytor
What Mikaeru said.

I had a very XML-heavy project where this was fairly important which is why that bit of text is there. However, since the xml header is '<?xml version="1.0" encoding="utf-8"?>' and we set the file encoding to UTF8, we don't really need the BOM.

Ooops! You have an older version of xtools.

The current version of that function looks like this:
Code: Select allStdlib.writeXMLFile = function(fptr, xml) {
  var rc;
  if (!(xml instanceof XML)) {
    Error.runtimeError(19, "xml"); // "Bad XML parameter";
  }

  var file = Stdlib.convertFptr(fptr);
  file.encoding = "UTF8";

  rc = file.open("w", "TEXT", "????");
  if (!rc && Stdlib.IOEXCEPTIONS_ENABLED) {
    Error.runtimeError(Stdlib.IO_ERROR_CODE, Stdlib.fileError(file));
  }

  // unicode signature, this is UTF16 but will convert to UTF8 "EF BB BF"
  // optional
  //file.write("\uFEFF");
  file.lineFeed = "unix";

  file.writeln('<?xml version="1.0" encoding="utf-8"?>');

  rc = file.write(xml.toXMLString());
  if (!rc && Stdlib.IOEXCEPTIONS_ENABLED) {
    Error.runtimeError(Stdlib.IO_ERROR_CODE, Stdlib.fileError(file));
  }

  rc = file.close();
  if (!rc && Stdlib.IOEXCEPTIONS_ENABLED) {
    Error.runtimeError(Stdlib.IO_ERROR_CODE, Stdlib.fileError(file));
  }

  return file;
};


BTW, the ExtendScript editor has options for writing a BOM into your jsx files.

Unicode signature

Posted: Wed Sep 25, 2013 7:22 am
by undavide
Thank you both!
So did I get it correctly if I say that writing the BOM and setting file.encoding is redundant?
X, I've extracted that StdLib function somewhere in a chunk of code I've found... ehm, can't remember where, I guess in the bottomless pit of PS-Scripts forums!
Thanks again,

Davide

Unicode signature

Posted: Wed Sep 25, 2013 2:23 pm
by xbytor
So did I get it correctly if I say that writing the BOM and setting file.encoding is redundant?

As long as you set the file.encoding and write the '<?xml version="1.0" encoding="utf-8"?>' header, you're set and you can safely ignore the BOM. The only time the BOM becomes relevant in this case is if you are doing something swishy with the file length and there is a BOM. I've got chunks of code laying around somewhere to deal with this.

Unicode signature

Posted: Wed Sep 25, 2013 3:12 pm
by undavide
Got it, thank you!

Davide