On Sat, Nov 14, 2015 at 02:37:17AM -0500, Karl Dahlke wrote:
> The Windows port has raised the issue of the byte order mark,
> which is prevalent in windows files, but virtually nonexistent in unix.
> If we do choose to support this, I would read the BOM,
> convert the file to utf8 for internal use, then convert it back with its BOM
> if that file or any portion of it was written to disk.

As someone who doesn't use Windows I don't really mind if we support this or not, but suspect we probably should since we're getting a Windows port.
> There is a precedent for this.
> An iso8859 file is converted to utf8, then converted back upon write.
> Try it and see.
> But only iso8859-1, and even this we may not support for long,
> as unix / linux is almost 100% utf8 at this point.
> Anyway there is some machinery in place.

I was actually wondering if this machinary could be switched to use iconv for
the conversion? We could then read all sorts of character sets and not have to
worry too much.
May be this, and the BOM should be a user command, i.e.:
Have some way of detecting the character encoding and set appropriately in pcre (assuming we can do this, libmagic may be),
Have a command to convert the character set (with appropriate pcre mode changes etc),
Also have a toggle to add a BOM (i.e. if we switch to a unicode charset) but don't add automatically,
The acception to the above would be in the case where we have a BOM already, in which case a message is printed and the default state of the BOM toggle is changed to on.
This would also allow edbrowse to be used to remove a BOM when it's not needed.
This also means no internal conversion to utf8 since this doesn't always behave
quite as expected so it's probably best to allow the user to specify this.

> The real key for me is the search and substitute commands.
> These are under control of pcre, which runs in utf8 mode.
> /ni.o/ will match niño, with the dot matching
> the 2 byte utf8 char n tilde.
> So if everything is utf8 inside then all the searches and substitutes
> will work the way our international users would want and expect.

Unless they're using some strange charset which doesn't quite transliterate in the way we expect.

> This is thinking ahead, I don't expect to implement BOM tomorrow.

Agreed, which means we've got time to evaluate where we are with this stuff which is good.

Cheers,
Adam.