From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from resqmta-ch2-11v.sys.comcast.net (resqmta-ch2-11v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:43]) by hurricane.the-brannons.com (Postfix) with ESMTPS id 47ED878953 for ; Fri, 13 Nov 2015 23:36:33 -0800 (PST) Received: from resomta-ch2-05v.sys.comcast.net ([69.252.207.101]) by resqmta-ch2-11v.sys.comcast.net with comcast id hKdJ1r0012Bo0NV01KdJL1; Sat, 14 Nov 2015 07:37:18 +0000 Received: from eklhad ([IPv6:2601:405:4080:53:21e:4fff:fec2:a0f1]) by resomta-ch2-05v.sys.comcast.net with comcast id hKdH1r0071DsNmD01KdHLG; Sat, 14 Nov 2015 07:37:17 +0000 To: Edbrowse-dev@lists.the-brannons.com From: Karl Dahlke Reply-to: Karl Dahlke User-Agent: edbrowse/3.5.4.2+ Date: Sat, 14 Nov 2015 02:37:17 -0500 Message-ID: <20151014023717.eklhad@comcast.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1447486638; bh=H4STTPBQbSeWMYhfYktePomdvX9yH9MmLe5RezOrQpk=; h=Received:Received:To:From:Reply-to:Subject:Date:Message-ID: Mime-Version:Content-Type; b=b1m/i2sXw+7gLkOw95Gat7EQ/T3kiFj23Paqr7IA7Lcyw6pb18QEUha5P7Y8Watf9 4lHX6P3dN3GRLpG6zzyyGv4MwLX5KSvmrcMZt/+pJFnqbDKTwh58nWWVrmHNa338tg kSdsa1k5qIClIvOW4qNLFjbv5Yji0H5LLnY3HD4LznuSu1DpPX9gg/Oiupgqr8jEr8 9qcRBXqR++jE5VMh8o8RW83C/SZ6PocCx3E+TVwMvBpAxOhxqTH/FgAVWbUA+4x7z+ 4vk8Si1vaFeLLdcILIaW/ATf+l4mgHbtZnA7YJ7rJcQeTq+Exehh4TQVowwErctaON nhfFTKF/+h8aQ== Subject: [Edbrowse-dev] BOM X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Nov 2015 07:36:33 -0000 The Windows port has raised the issue of the byte order mark, which is prevalent in windows files, but virtually nonexistent in unix. If we do choose to support this, I would read the BOM, convert the file to utf8 for internal use, then convert it back with its BOM if that file or any portion of it was written to disk. There is a precedent for this. An iso8859 file is converted to utf8, then converted back upon write. Try it and see. But only iso8859-1, and even this we may not support for long, as unix / linux is almost 100% utf8 at this point. Anyway there is some machinery in place. The real key for me is the search and substitute commands. These are under control of pcre, which runs in utf8 mode. /ni.o/ will match niƱo, with the dot matching the 2 byte utf8 char n tilde. So if everything is utf8 inside then all the searches and substitutes will work the way our international users would want and expect. This is thinking ahead, I don't expect to implement BOM tomorrow. Karl Dahlke