9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: erik quanstrom <quanstro@quanstro.net>
To: 9fans@9fans.net
Subject: Re: [9fans] Woes of New Language Support
Date: Sun, 26 Jul 2009 20:28:00 -0400	[thread overview]
Message-ID: <4b18bb6a5d0b7cedb5651c485ddae489@quanstro.net> (raw)
In-Reply-To: <6e35c0620907261139u610c0431rbc3ecff6b16def29@mail.gmail.com>

On Sun Jul 26 14:40:56 EDT 2009, knapjack@gmail.com wrote:
> If I'm reading you right, you're saying it might be easier if
> everything were encoded as combining (or maybe more aptly
> non-combining) codes, regardless of language?
>
> So, we might encode 'Waffles' as w+upper a f f l e s and let the
> renderer (if there is one) handle the presentation of the case shift
> and the potential ligature, but things like grep get noticeably easier
> with no overlap of ő and o+umlaut.
>
> Again, oversimplified, with no real understanding on my part of the
> depth or breadth of the problem space.

you understand.  except, i was taking the opposite position.

if you did for english what is done for indic languages,
if you typed 'this is a sentence.' the 't' would be capitalized
as soon as you typed the '.'.  there's no hint that this rule
need to be applied, the rendered would just have to know
it.  in ak's example a certain combination of codepoints yields
a specific 'letter'.  (i hope i have that right.)  the renderer is
just supposed to know this.  so for consistency and reducing
the need for complicated language-specific (how do we know
that the text represented is actually from the language we think
it is?), i would force the producer to declare the combinations.

btw, the search problem is not at all solved by standardizing
(or is that standardising?) the combiners problem.  consider
the following bits of unicode fun:

; grep 'zero width' /lib/unicode
200b	zero width space
200c	zero width non-joiner
200d	zero width joiner
feff	zero width no-break space

i'm sure that someone more conversant in unicode could
point out other points of real difficulty.

how do you tell unicode from uni\ufeffcode?  not only
is that an annoyance, but it could be a pretty interesting
security problem.  and what a gift for spammers!

- erik



  reply	other threads:[~2009-07-27  0:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-26  1:55 akumar
2009-07-26  5:08 ` erik quanstrom
2009-07-26  7:41   ` andrey mirtchovski
2009-07-26 14:32     ` erik quanstrom
2009-07-28 10:39       ` Charles Forsyth
2009-07-28 14:11         ` Ethan Grammatikidis
2009-07-28 14:52           ` John Floren
2009-07-28 17:46             ` Ethan Grammatikidis
2009-07-26  9:04   ` Salman Aljammaz
2009-07-26 13:48     ` erik quanstrom
2009-07-26 14:12       ` tlaronde
2009-07-26 14:24         ` erik quanstrom
2009-07-26 17:56       ` Nathaniel W Filardo
2009-07-26 18:39       ` Jack Johnson
2009-07-27  0:28         ` erik quanstrom [this message]
2009-07-26 11:43 Akshat Kumar
2009-07-26 12:01 Akshat Kumar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4b18bb6a5d0b7cedb5651c485ddae489@quanstro.net \
    --to=quanstro@quanstro.net \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).