From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: From: erik quanstrom Date: Wed, 10 Oct 2007 08:22:42 -0400 To: 9fans@cse.psu.edu Subject: Re: [9fans] simplicity In-Reply-To: <6e35c0620710092317x5d704601k70090c8ed2cbff0e@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: cdb1c72e-ead2-11e9-9d60-3106f5b1d025 > I was thinking of the simplistic scenario, where someone might be > looking for ni=C3=B1o in some file, regardless of what locale they migh= t > happen to be in. Now I can imagine the nightmare it must be for > non-English speakers looking for letter combinations irrespective of > accents. >=20 > But, it seems more like a problem with the shorthand than grep, per > se. i agree with this. or it's a historical problem with the character set. clearly if you were designing a universial character set with no compatab= ility constraints, the alphabet would have n=C3=B1 together so [a-z] would=20 match both. > I could see an argument for [:alpha:] potentially matching n and > =C3=B1 depending on the locale, but [a-z] not matching =C3=B1 in any lo= cale. But > even that, my tendency would be that [:alpha:] match =C3=B1 in every > locale. >=20 > But then, does [:alpha:] match =E1=BC=84=CE=B3=CE=B1=CE=B8=CE=BF=CF=82?= How ironic that it doesn't match =CE=B1. i don't think one can go this route. you can't have a magic environment variable that changes everything. testing is a nightmare in such a world= . you have to go through every combination of (data cs, locale) to see if things are working. a better solution is to use the properties of unicode. =C3=B1 is noted i= n the table as 00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small le= tter n tilde;;00d1;;00d1 field 6 has the base codepoint 006e as its first subfield. it would not = be hard to build a table quickly mapping a codepoint to its base codepoint =CF=83= . but it would probablly be most useful to also have a mapping from base codepoints to all composed forms =CE=BE. suppose, for lack of creativity, we use =C2=BB to mean all base codepoint= s matching the next item character so =C2=BBa matches =C3=A4 as does =C2=BB= [a-z]. so for =C2=BB of a letter c can be grepped by taking =CE=BE=CF=83(c) whic= h results in a character class. plan 9 already has some of this in the c library with tolowerrune, etc. i did some work with this some time ago and wrote some rc scripts to generate the to*rune tables from the unicode standard data. it would be easy to adapt them to generate =CE=BE and =CF=83. (the tables would b= e pretty big.) >=20 > What an ugly problem. it can be made ugly quickly. but i'm not convinced that all approaches to this problem are bad. - erik