Re: Unicode stuff (was: Re: Specifying BibTeX engine)

From: "Mojca Miklavec" <mojca.miklavec.lists@gmail.com>
Cc: Jonathan Kew <jonathan_kew@sil.org>,
	Philipp Reichmuth <reichmuth@web.de>
Subject: Re: Unicode stuff (was: Re: Specifying BibTeX engine)
Date: Thu, 9 Nov 2006 17:47:31 +0100	[thread overview]
Message-ID: <6faad9f00611090847l67df5b49w9a3ad313bad7cf2b@mail.gmail.com> (raw)
In-Reply-To: <eiieio$b7s$1@sea.gmane.org>

On 11/4/06, Philipp Reichmuth wrote:
> I've been starting to reuse some of this work in a script to do active
> character assignment for XeTeX depending on what glyphs are present in
> an OpenType font, so that those characters for which the font doesn't
> have a glyph are generated by ConTeXt.  Basically I want to produce
> something like this:
>
> \ifnum\XeTeXcharglyph"010D=0
>      \catcode`č=\active \def č{\ccaron}
> \else
>      \catcode`č=\letter
> \fi % ConTeXt knows this letter -> better hyphenation
>
> \ifnum\XeTeXcharglyph"1E0D=0
>      \catcode`ḍ=\active \def ḍ{\b{d}}
> \else
>      \catcode`ḍ=\letter
> \fi % ConTeXt doesn't know this letter

No reason for not adding it.

> (with \other, respectively, for non-letters).  Being somewhat of a
> novice to TeX programming, I'm not sure if this will work, though, and
> I'm also not sure if it's better to generate static scripts that do this
> for every font (so the resulting TeX file is a font-specific big list of
> \catcode`$CHARACTERs) or to do this dynamically on every font change,
> maybe limited to selectable Unicode ranges (which is more general but
> also a lot slower).

Generating this for every single font would be stupid. This should be
part of low-level XeTeX (Jonathan has promised to look into it some
time). In my opinion the best way to deal with it would be the ability
to define a fall-back definition for "every" missing letter in a font.
Consequently, if you have "ddotbelow" missing in your font, XeTeX
would ask ConTeXt if some fallback definition has been provided for
that glyph, If yes, it would fall back to it, "\b{d}", but if the
glyph would be present in that font, XeTeX would use it.

> > I'd prefer to see a context encoding added to GNU recode for the
> > benefit of future archeologists trying to decipher ancient documents.
>
> That would be better I guess, but isn't ConTeXt encoding a moving target
> in that characters can still get added?  Or is the list fixed to AGL
> glyph names and nothing else?

No, it's certainly not fixed to AGL. But I wouldn't object adding it
to GNU recode (on top of "(La)TeX" which also recognizes \v, \b, ...)
if someone would decide to make a good revision of it and if more
people think that it would be useful (and if developers are open to
that idea). I try to use Unicode when writing sources whenever
possible.

Mojca

PS for Philipp: I didn't try out your definitions, but you have a cut
out of an older conversation as an example of what certainly doesn't
work under XeTeX ;)
(answer was written by Jonathan Kew) I was trying write a few macros
to support the old tfm-based fonts, but figured out that that was the
wrong starting point (and also other reason than yours).

> \catcode`ð=\active \defð{^^f0}
> \starttext
> Testing ... ð
> \stoptext
>
> and it seems to enter some infinite loop when ð is encountered (I can
> define any other letter as well, but only ^^f0 is causing problems).

No, this seems to me like it's the wrong way to define the character!
And I think you would have the same problem with other letters if
trying to define them as their own codes; the ones that work for you
must be getting defined as *different* codes from the original input.

The ^^xx notation is converted to a literal character by TeX's input
scanning routine, so it behaves exactly as if it were that character
itself. And ^^f0 in Latin-1 (or Unicode) is the ð character. So this
definition works exactly the same as if you were to say

  \catcode`ð=\active \defð{ð}

which is clearly recursive.

Given that you don't need to remap ð in the input to some other
Unicode character for printing, there should be no need for this at
all. The only reason to use a definition like this would be if the
input text used a *different* character where you want to print eth;
or you want to print something *other* than character F0 for the
input ð.

In general, a "safe" form of the definition would be to use \chardef:

  \catcode`ð=\active \chardefð="F0

This makes ð into a macro that expands to the character "F0; there is
an important difference between this and ^^f0, which actually
"becomes" the character ð itself as the input is read (and therefore
inherits its catcode, definition, etc).
_______________________________________________
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context