Digraphs ij, nj, lj, ch, ... (was: new upload)

From: "Mojca Miklavec" <mojca.miklavec.lists@gmail.com>
To: "mailing list for ConTeXt users" <ntg-context@ntg.nl>
Subject: Digraphs ij, nj, lj, ch, ... (was: new upload)
Date: Mon, 27 Aug 2007 03:57:36 +0200	[thread overview]
Message-ID: <6faad9f00708261857i1c6f1213yce11d8596e85da7d@mail.gmail.com> (raw)

On 8/26/07, Hans Hagen wrote:

> anyhow, ... new upload to play with

Thanks! "Lj"igatures work much better now ;), but \word lost its
functionality, it seems.

>  > - LM doesn't have any lj, nj, dz, dž, ... (probably another request
>  > for the Polish guys)
>
> hm, just write a small proposal ...

I already did.

> however, dealing with non present chars is to be dealt with anyway
>
>  > - It would be great if MK IV did the trasformation from digraphs to
>  > normal letters in case those digraphs are not present in the font
>  > itself (for ij, lj, nj, dz, dž, ... just as it would be great if
>  > ccaron was automatically composed out of c and caron if the letter
>  > wasn't present in that font).
>
> \definefontfeature
> [test][mode=node,language=dflt,script=latn,complement=yes]
>
> {\font\test = lmtypewriter8-regular*test at 12.3pt \test Ç‰ubÇ‰ana
> ÇˆubÇ‰ana  Ç‡UBÇ‡ANA }
>
> currently the complement only replaces LATIN/compat combinations (see
> char-def.lua)

(encoding in email screwed up a bit, but no problem)

Great! This works perfect!

It works as expected for fonts with no such glyphs (and could/should
be added to default font features in my opinion). When tested with
fonts including original glyphs (I was testing with IJ), the original
glyph was used, so that was perfectly OK.

>  > Visually there is probably no difference in plain text, except in
>  > exactly the cases for which you're sending the tests (that's casing
>  > and spacing). See http://en.wikipedia.org/wiki/Gaj's_Latin_alphabet
>  > how the word "MJENJACNICA" is split into letters.
>  > Normal people still type n+j in text, not the digraph "?" (nj), but in
>  > case you get some text with those digraphs which are valid Unicode
>  > letters, it would be nice if they were processed ...
>
> dealing with n+j in text is too dangerous to catch, unless we start
> implementing complex language depenent replacements, and even then it's
> messy (what to do when one really wants a nj (two char)) ... so, thos
> old docs can best be converted to proper utf then

Hmmm ... pdfTeX with LM already replaces all occurencies of ij with an
ij ligature (noticed when I took a look at the strange kerning between
the two letters - not present with CM). :-Z

Consider the desired output of
    \Word{ijsselmeer} % example taken from wikipedia, I don't speak
Dutch yet ;-)
when writing in Dutch. (\mainlanguage[nl], not when writing in English)
One might like to treat every ij as a single letter and then convert
both I and J to uppercase when asked for that.

I read that Dutch keyboard includes the ij digraph (ĳ), but the
Croatian/Serbian keyboards don't include those digraphs and none of
the cp1250 and iso-8859-2 encodings have it, so everyone writes with
"plain latin letters" - I doubt that anyone uses digraphs at all. (It
would be almost as obscure and inconvenient to write them as if
someone tried to write with "fi" unicode ligatures.)

Yet the third group of people are the Czech/Slovaks with their digraph
"ch". (You probably remember that one since you had to implement
sorting rules for them for every variant and version of the sorting
mechanism you have ever written.) Unicode doesn't even have place for
it (http://unicode.org/faq/ligature_digraph.html).

In Croatian, nj is always considered to be a single letter. (Even in
foreign words, you would see "Isak Njutn" [Nj-u-t-n] or "Ajnštajn", so
basically no worries about exceptions. Even if there would be some,
one could always say {\language[en] a foreign word with nj})

>  > Another few observations:
>  > - \word doesn't work in XeTeX
>
> no, neither in pdftex i think; new

Currently it doesn't even work in luatex any more :-(

>  > - What exactly is \Words supposed to do (with non-first letters in a
> word)?
>
> make first chars uppercase but only when the next is a char; (i changed
> it a bit, defs were not seen (overloaded later by macros)
>
>  > An extra challenge would be to get this work (but unless some Croats
>  > ask you for that or unless you have too much time left, don't bother
>  > about that - it needs slightly more than only lccode and uccode of a
>  > letter since there are three forms: one for lowercase [ljubljana ->
>  > lj], one for all-uppercase words [LJUBLJANA -> LJ] and one for the
>  > first letter of a word starting with an uppercase [Ljubljana -> Lj]):
>  >
>  > In Unicode:
>  >
>  > \word{?ub?ana} -> ?ub?ana
>  > \Word{?ub?ana} -> ?ub?ana
>  > \WORD{?ub?ana} -> ?UB?ANA
>  >
>  > \word{?ub?ana} -> ?ub?ana
>  > \Word{?ub?ana} -> ?ub?ana
>  > \WORD{?ub?ana} -> ?UB?ANA
>  >
>  > \word{?UB?ANA} -> ?ub?ana
>  > \Word{?UB?ANA} -> ?ub?ana
>  > \WORD{?UB?ANA} -> ?UB?ANA
>
> as long as we have utf it's already taken care of

It's not.

In contrary (written in latin without ligatures):
\WORD{Lj} -> LJ (let's say it's OK)
\WORD{LJ} -> Lj (wrong)

\word lost it's functionality, so I cannot check.

The main problem is that:
- ligature lj is always lowercase
- ligature Lj is uppercase, but only at the beginning of a word where
other letters are lowercase (Ljubljana)
- ligature LJ is uppercase, but only at the beginning of a word where
other letters are uppercase (LJUBJANA)

This works as long as L and J are two separate letters, but fails in
the example where we have ligatures/digraps (even the basic
functionality is currently broken).

See http://unicode.org/charts/case/index.html (search for 01C7 on that
page - while most letter only have lowercase and uppercase, those few
also have some kind of "middle" case.)

But again: I have no idea how many people use \Word and \WORD to
capitalize words (I used it in some cases where I modified the macro
to get really fancy beginning of words). Most would probably solve the
problem "manually" anyway. (So don't bother about implementing until
someone really requests it.)

>  > In Latin transcript (in case you have problems seing some Unicode
> letters):
>  >
>  > \word{ljubljana} -> ljubljana
>  > \Word{ljubljana} -> Ljubljana
>  > \WORD{ljubljana} -> LJUBLJANA
>  >
>  > \word{Ljubljana} -> ljubljana
>  > \Word{Ljubljana} -> Ljubljana
>  > \WORD{Ljubljana} -> LJUBLJANA
>  >
>  > \word{LJUBLJANA} -> ljubljana
>  > \Word{LJUBLJANA} -> Ljubljana
>  > \WORD{LJUBLJANA} -> LJUBLJANA

>  >> {\setcharacterkerning[extrakerning]\input zapf\endgraf }
>
> hm, i'm not going to backport everything; keep in mind that i these
> features are not font related; actually future mkiv versions will also
> do dynamic feature change so ...

(This has recently been added to XeTeX as well. So it doesn't
necessary mean "backport the lua functionality", but more "map this
keyword to that XeTeX feature". But I would need to check. Don't worry
about it. If anyone asks, then maybe ...)

Thanks a lot,
    Mojca
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________