9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: erik quanstrom <quanstro@quanstro.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] simplicity
Date: Wed, 10 Oct 2007 08:22:42 -0400	[thread overview]
Message-ID: <d805c5d583d15a45010664ab8e838326@quanstro.net> (raw)
In-Reply-To: <6e35c0620710092317x5d704601k70090c8ed2cbff0e@mail.gmail.com>

> I was thinking of the simplistic scenario, where someone might be
> looking for niño in some file, regardless of what locale they might
> happen to be in.  Now I can imagine the nightmare it must be for
> non-English speakers looking for letter combinations irrespective of
> accents.
> 
> But, it seems more like a problem with the shorthand than grep, per
> se.

i agree with this.  or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatability
constraints, the alphabet would have nñ together so [a-z] would 
match both.

> I could see an argument for [:alpha:] potentially matching n and
> ñ depending on the locale, but [a-z] not matching ñ in any locale. But
> even that, my tendency would be that [:alpha:] match ñ in every
> locale.
> 
> But then, does [:alpha:] match ἄγαθος?  How ironic that it doesn't match α.

i don't think one can go this route.  you can't have a magic environment
variable that changes everything.  testing is a nightmare in such a world.
you have to go through every combination of (data cs, locale) to see if
things are working.

a better solution is to use the properties of unicode.  ñ is noted in the
table as

00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small letter n tilde;;00d1;;00d1

field 6 has the base codepoint 006e as its first subfield.  it would not be hard
to build a table quickly mapping a codepoint to its base codepoint σ.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms ξ.

suppose, for lack of creativity, we use » to mean all base codepoints
matching the next item character so »a matches ä as does »[a-z].
so for » of a letter c can be grepped by taking ξσ(c) which results
in a character class.

plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data.  it would
be easy to adapt them to generate ξ and σ.  (the tables would be pretty big.)

> 
> What an ugly problem.

it can be made ugly quickly.  but i'm not convinced that all approaches
to this problem are bad.

- erik


  reply	other threads:[~2007-10-10 12:22 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-16 18:55 Francisco J Ballesteros
2007-09-16 20:42 ` Anant Narayanan
2007-09-16 21:24   ` Francisco J Ballesteros
2007-09-17 15:22     ` Douglas A. Gwyn
2007-09-16 20:43 ` roger peppe
2007-09-16 20:53   ` Steve Simon
2007-09-17 15:22     ` Douglas A. Gwyn
2007-09-17 20:00   ` Scott Schwartz
2007-09-17  3:23 ` erik quanstrom
2007-09-17 15:22   ` Douglas A. Gwyn
2007-09-17 15:55     ` erik quanstrom
2007-09-18  8:38       ` Douglas A. Gwyn
2007-09-18 10:45         ` dave.l
2007-09-18 14:44           ` Iruata Souza
2007-09-18 15:41             ` Douglas A. Gwyn
2007-09-18 21:34               ` Iruata Souza
2007-10-10  3:30         ` Jack Johnson
2007-10-10  4:02           ` erik quanstrom
2007-10-10  6:17             ` Jack Johnson
2007-10-10 12:22               ` erik quanstrom [this message]
2007-09-18 15:27     ` Rob Pike
2007-09-18 15:38       ` Uriel
2007-09-19  8:50         ` Douglas A. Gwyn
2007-09-19 11:51           ` erik quanstrom
2007-09-19 15:02             ` Russ Cox
2007-09-19 14:17           ` Charles Forsyth
2007-09-19 14:21           ` Iruata Souza
2007-09-19 15:32           ` Skip Tavakkolian
2007-10-09 20:08         ` Aharon Robbins
2007-10-09 21:08           ` Uriel
2007-10-10  5:33         ` sqweek
2007-10-10 11:49           ` erik quanstrom
2007-09-17 14:52 ` ron minnich
2007-09-17 14:53 ` ron minnich
2007-10-10  7:36 John Stalker
2007-10-10  8:24 ` Charles Forsyth
2007-10-10 11:47 ` erik quanstrom
2007-10-10 14:05   ` John Stalker
2007-10-10 14:29     ` erik quanstrom
2007-10-10 15:26       ` John Stalker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d805c5d583d15a45010664ab8e838326@quanstro.net \
    --to=quanstro@quanstro.net \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).