From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: From: erik quanstrom Date: Wed, 10 Oct 2007 00:02:38 -0400 To: 9fans@cse.psu.edu Subject: Re: [9fans] simplicity In-Reply-To: <6e35c0620710092030u1187029dhf54f67e48a62b85c@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: cd87f2c8-ead2-11e9-9d60-3106f5b1d025 > Yes, old thread, sorry. Blame Uriel. >=20 > On 9/18/07, Douglas A. Gwyn wrote: > > erik quanstrom wrote: > > > suppose Linux user a and user b grep the same "text" file for the s= ame string. > > > results will depend on the users' locales. > > > > But if they're trying to match an alphabetic character class, the > > result *should* depend on the locale. >=20 > This baffles me. Can anyone think of examples where one might want > differing results depending on your locale? >=20 > -Jack i think i see what the reasoning is. the thought is that, e.g., in spanish [a-z] should match =C3=B1. =20 the problem is this means that grep(regexp, data) now returns a set of results, one for each locale. so on the one hand, one would like [a-z] to do the Right Thing, depending on language. and on the other hand, one wants grep(regexp, data) to return a single result. i think the way to see through this issue is to notice that the reason we want =C3=B1 to be in [a-z] is because of visual similarity. what if we were dealing with chinese? i think it's pretty clear that [a-z] should map to a contiguous set of unicode codepoints. if you want to deal with =C3=B1, the unicode tables do note that =C3=B1 is n+combining ~, so one could come up with a new denotation for base codepoint. unfortunately the combining that with existing regexp would be a bit painful. - erik