From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-x229.google.com (mail-wm0-x229.google.com [IPv6:2a00:1450:400c:c09::229]) by hurricane.the-brannons.com (Postfix) with ESMTPS id 6476777AF8 for ; Sat, 14 Nov 2015 02:16:58 -0800 (PST) Received: by wmww144 with SMTP id w144so56857049wmw.1 for ; Sat, 14 Nov 2015 02:17:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=TLm4NPsUWdU4gTN4dLeqTKbIIi079QppRhGrZLz7i4w=; b=g57KgCy/1a8dN3Z0jQQ+z8nRGUll+vj0akUhU21RRusnl1CAY3HJtUtbSIWjcDLvPP /MHqA+LWO97xNjgJGOFf9YAisQfzGdUcHPRM8nm/NYwNVW4upqg5qOdabTTaUDeyeUnH jdaZHEkqD3Gyf6fLto85DeT9ewydqhVWozeUxk4KLRjb52ijE5IUXIIsnoH16V2kdWIC nBsBRJZ8Y+VxIehn8nqDZyv9WElxnA4NMhE/XH78rL9+DcA0rZNmSTa5BByLfG/jUiEH v5SMu9Hl5gMo43AKjP6xSiWBxmlnTOW9PGuYEBLq1GH2JgTh9QEspy+4C74TKrLRnQEK Ikrw== X-Received: by 10.194.58.44 with SMTP id n12mr27146990wjq.118.1447496262648; Sat, 14 Nov 2015 02:17:42 -0800 (PST) Received: from toaster.adamthompson.me.uk (toaster.adamthompson.me.uk. [2001:8b0:1142:9042::2]) by smtp.gmail.com with ESMTPSA id om1sm24149544wjc.2.2015.11.14.02.17.41 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 14 Nov 2015 02:17:41 -0800 (PST) Date: Sat, 14 Nov 2015 10:17:40 +0000 From: Adam Thompson To: Karl Dahlke Cc: Edbrowse-dev@lists.the-brannons.com Message-ID: <20151114101740.GC2985@toaster.adamthompson.me.uk> References: <20151014023717.eklhad@comcast.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="eHhjakXzOLJAF9wJ" Content-Disposition: inline In-Reply-To: <20151014023717.eklhad@comcast.net> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [Edbrowse-dev] BOM X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Nov 2015 10:16:58 -0000 --eHhjakXzOLJAF9wJ Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Nov 14, 2015 at 02:37:17AM -0500, Karl Dahlke wrote: > The Windows port has raised the issue of the byte order mark, > which is prevalent in windows files, but virtually nonexistent in unix. > If we do choose to support this, I would read the BOM, > convert the file to utf8 for internal use, then convert it back with its = BOM > if that file or any portion of it was written to disk. As someone who doesn't use Windows I don't really mind if we support this o= r not, but suspect we probably should since we're getting a Windows port. > There is a precedent for this. > An iso8859 file is converted to utf8, then converted back upon write. > Try it and see. > But only iso8859-1, and even this we may not support for long, > as unix / linux is almost 100% utf8 at this point. > Anyway there is some machinery in place. I was actually wondering if this machinary could be switched to use iconv f= or the conversion? We could then read all sorts of character sets and not have= to worry too much. May be this, and the BOM should be a user command, i.e.: Have some way of detecting the character encoding and set appropriately in = pcre (assuming we can do this, libmagic may be), Have a command to convert the character set (with appropriate pcre mode cha= nges etc), Also have a toggle to add a BOM (i.e. if we switch to a unicode charset) bu= t don't add automatically, The acception to the above would be in the case where we have a BOM already= , in which case a message is printed and the default state of the BOM toggl= e is changed to on. This would also allow edbrowse to be used to remove a BOM when it's not nee= ded. This also means no internal conversion to utf8 since this doesn't always be= have quite as expected so it's probably best to allow the user to specify this. > The real key for me is the search and substitute commands. > These are under control of pcre, which runs in utf8 mode. > /ni.o/ will match ni=F1o, with the dot matching > the 2 byte utf8 char n tilde. > So if everything is utf8 inside then all the searches and substitutes > will work the way our international users would want and expect. Unless they're using some strange charset which doesn't quite transliterate= in the way we expect. > This is thinking ahead, I don't expect to implement BOM tomorrow. Agreed, which means we've got time to evaluate where we are with this stuff= which is good. Cheers, Adam. --eHhjakXzOLJAF9wJ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWRwpDAAoJELZ22lNQBzHOqoQH/09WAKHnyx51InNyWbcOP1U5 YiBo7bWJrSVW19wHV3iyrkQMG+7tYh9KHIgPqigXac+j8aah9craLdE3t6gL3TTn DDAsXO6Y6KcEcRxPT0ZV83GyPyuIzst5gwTBnWZyzeO0SD3Jhuyf6DZkYJkg71g0 6xV+ZnBtL35Nz5+jIMxckyiHk9GE2BLAcFeHcVosy9PCS9jUTgOJCHXS5L7yS9MG 6cOY6B/ljt8IPvli5mafQpWcQSIQZmPqiwTBVJ909ppb3BcxoGKp7bwpobwMd2O/ GnqdedXGBJd1YpudhAt6uQr3CbUWQMw/mgMNXGZeMxhgAhClHwAlYy4Kff+LKpI= =ZjhB -----END PGP SIGNATURE----- --eHhjakXzOLJAF9wJ--