From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <6e35c0620803031548v583c051flb4d9ace76e220998@mail.gmail.com> Date: Mon, 3 Mar 2008 14:48:01 -0900 From: "Jack Johnson" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu> Subject: Re: [9fans] awk, not utf aware... In-Reply-To: <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net> Topicbox-Message-UUID: 6de463fa-ead3-11e9-9d60-3106f5b1d025 On Thu, Feb 28, 2008 at 6:10 AM, erik quanstrom wro= te: > perhaps it would be more effective to break down the concept > a bit. instead of a general locale hammer, why not expose some > operations that could go into a locale? for example, have a base- > character folding switch that allows regexps to fold codpoints into > base codepoints so that =ED=EF=EC=EEi -> i. this information is in the = unicode > tables. perhaps the language-dependent character mapping should > be specified explictly. &c. Loosely-related tangent: http://www.mail-archive.com/rsync@lists.samba.org/msg20395.html > On the LINUX machines running utf-8 the =E4 is coded as $C3A4 which is > in utf-8 equal to the character E4. The =E4 occupies in that way 2 bytes. > > I was very astonished, when I copied a mac-filename, pasted into a > texteditor and looked at the file: > > In the mac-filename the letter =E4 is coded as: $61CC88, which in utf-8 > means the letter "a" followed by a $0308. (Combining diacritical marks) > So the Mac combines the letter a with the two points above it instead > using the E4 letter > Now the things are clear: The filenames are different, in spite of > looking equally. So, if folding codepoints is a reasonable tactic, how many representations do you need to fold? How many binary representations are needed to fold =ED=EF=EC=EEi -> i? -Jack