edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] BOM
@ 2015-11-14  7:37 Karl Dahlke
  2015-11-14 10:17 ` Adam Thompson
  0 siblings, 1 reply; 2+ messages in thread
From: Karl Dahlke @ 2015-11-14  7:37 UTC (permalink / raw)
  To: Edbrowse-dev

The Windows port has raised the issue of the byte order mark,
which is prevalent in windows files, but virtually nonexistent in unix.
If we do choose to support this, I would read the BOM,
convert the file to utf8 for internal use, then convert it back with its BOM
if that file or any portion of it was written to disk.
There is a precedent for this.
An iso8859 file is converted to utf8, then converted back upon write.
Try it and see.
But only iso8859-1, and even this we may not support for long,
as unix / linux is almost 100% utf8 at this point.
Anyway there is some machinery in place.

The real key for me is the search and substitute commands.
These are under control of pcre, which runs in utf8 mode.
/ni.o/ will match niño, with the dot matching
the 2 byte utf8 char n tilde.
So if everything is utf8 inside then all the searches and substitutes
will work the way our international users would want and expect.

This is thinking ahead, I don't expect to implement BOM tomorrow.

Karl Dahlke

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Edbrowse-dev] BOM
  2015-11-14  7:37 [Edbrowse-dev] BOM Karl Dahlke
@ 2015-11-14 10:17 ` Adam Thompson
  0 siblings, 0 replies; 2+ messages in thread
From: Adam Thompson @ 2015-11-14 10:17 UTC (permalink / raw)
  To: Karl Dahlke; +Cc: Edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 2432 bytes --]

On Sat, Nov 14, 2015 at 02:37:17AM -0500, Karl Dahlke wrote:
> The Windows port has raised the issue of the byte order mark,
> which is prevalent in windows files, but virtually nonexistent in unix.
> If we do choose to support this, I would read the BOM,
> convert the file to utf8 for internal use, then convert it back with its BOM
> if that file or any portion of it was written to disk.

As someone who doesn't use Windows I don't really mind if we support this or not, but suspect we probably should since we're getting a Windows port.
> There is a precedent for this.
> An iso8859 file is converted to utf8, then converted back upon write.
> Try it and see.
> But only iso8859-1, and even this we may not support for long,
> as unix / linux is almost 100% utf8 at this point.
> Anyway there is some machinery in place.

I was actually wondering if this machinary could be switched to use iconv for
the conversion? We could then read all sorts of character sets and not have to
worry too much.
May be this, and the BOM should be a user command, i.e.:
Have some way of detecting the character encoding and set appropriately in pcre (assuming we can do this, libmagic may be),
Have a command to convert the character set (with appropriate pcre mode changes etc),
Also have a toggle to add a BOM (i.e. if we switch to a unicode charset) but don't add automatically,
The acception to the above would be in the case where we have a BOM already, in which case a message is printed and the default state of the BOM toggle is changed to on.
This would also allow edbrowse to be used to remove a BOM when it's not needed.
This also means no internal conversion to utf8 since this doesn't always behave
quite as expected so it's probably best to allow the user to specify this.

> The real key for me is the search and substitute commands.
> These are under control of pcre, which runs in utf8 mode.
> /ni.o/ will match niño, with the dot matching
> the 2 byte utf8 char n tilde.
> So if everything is utf8 inside then all the searches and substitutes
> will work the way our international users would want and expect.

Unless they're using some strange charset which doesn't quite transliterate in the way we expect.

> This is thinking ahead, I don't expect to implement BOM tomorrow.

Agreed, which means we've got time to evaluate where we are with this stuff which is good.

Cheers,
Adam.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-11-14 10:16 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-14  7:37 [Edbrowse-dev] BOM Karl Dahlke
2015-11-14 10:17 ` Adam Thompson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).