From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <6e3f8d391c4219c9b0acb323d74603d6@quanstro.net> To: 9fans@cse.psu.edu Subject: Re: [9fans] awk, not utf aware... From: erik quanstrom Date: Thu, 28 Feb 2008 10:10:29 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 673c491e-ead3-11e9-9d60-3106f5b1d025 i had to dig this off 9fans.net/archive. htmlfmt does some very bad thin= gs with non-ascii characters. i hope i put them back correctly. > Yes, and then there is locale: does [a-z] include =C4=B3 when you run i= t > in Holland (it should)? Does it include =C3=A1, =C3=A8, =C3=B4 in Fran= ce (it should)? > Does it include =C3=B8, =C3=A5 in Norway (it should not)? And what hap= pens when > you evaluate "=C3=A8"< "o" (it depends)? >=20 > Fixing awk is much harder than anyone things. I had a chat about it wi= th > Brian Kernighan and he says he's been thinking about fixing awk for a > long time, but that it really is a hard problem. how does a program know where it's being run? =E2=98=BA how do you write= a program that processes byte streams from a dutch user and from a norwegian? how does one deal with a multi-language file. i see some problems with localized regexps. like pre-utf character sets, it's impossible to tell from a byte stream what the character set is. two users can run the same program and get different results. (how do you test in an environment like this?) and, of course, you can't switch locale within a file making multi-language files difficult. perhaps it would be more effective to break down the concept a bit. instead of a general locale hammer, why not expose some operations that could go into a locale? for example, have a base- character folding switch that allows regexps to fold codpoints into base codepoints so that =C3=AD=C3=AF=C3=AC=C3=AEi -> i. this information= is in the unicode tables. perhaps the language-dependent character mapping should be specified explictly. &c. - erik