From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/107097 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Joey McCollum Newsgroups: gmane.comp.tex.context Subject: Fwd: Unicode normalization and Hebrew in ConTeXt Date: Tue, 28 Apr 2020 12:16:34 -0400 Message-ID: References: <0c597527-8809-20b7-9ac9-fee80da73637@xs4all.nl> Reply-To: mailing list for ConTeXt users Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3610420230271110838==" Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="17329"; mail-complaints-to="usenet@ciao.gmane.io" To: ntg-context@ntg.nl Original-X-From: ntg-context-bounces@ntg.nl Tue Apr 28 18:17:17 2020 Return-path: Envelope-to: gctc-ntg-context-518@m.gmane-mx.org Original-Received: from zapf.boekplan.nl ([5.39.185.232] helo=zapf.ntg.nl) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jTSv6-0004Nk-EY for gctc-ntg-context-518@m.gmane-mx.org; Tue, 28 Apr 2020 18:17:16 +0200 Original-Received: from localhost (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id E5B17183870; Tue, 28 Apr 2020 18:16:52 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at zapf.boekplan.nl Original-Received: from zapf.ntg.nl ([127.0.0.1]) by localhost (zapf.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id giziddrn2tp4; Tue, 28 Apr 2020 18:16:50 +0200 (CEST) Original-Received: from zapf.ntg.nl (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id EC357183861; Tue, 28 Apr 2020 18:16:50 +0200 (CEST) Original-Received: from localhost (localhost [127.0.0.1]) by zapf.ntg.nl (Postfix) with ESMTP id 92E8218385C for ; Tue, 28 Apr 2020 18:16:49 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at zapf.boekplan.nl Original-Received: from zapf.ntg.nl ([127.0.0.1]) by localhost (zapf.ntg.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id c3lAos83r6Ts for ; Tue, 28 Apr 2020 18:16:48 +0200 (CEST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.218.41; helo=mail-ej1-f41.google.com; envelope-from=jmccollum20140511@gmail.com; receiver= Original-Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)) (No client certificate requested) by zapf.ntg.nl (Postfix) with ESMTPS id 567811835F8 for ; Tue, 28 Apr 2020 18:16:48 +0200 (CEST) Original-Received: by mail-ej1-f41.google.com with SMTP id pg17so17700390ejb.9 for ; Tue, 28 Apr 2020 09:16:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=WjXMz27x/U9dOhLhFc+mw0CtFNBv0pNYEu/IydrAFwI=; b=NU7IBbV9YRrkjncLyX9KRsIipKVWb9qjtiJJ+NUO4ZvBWeSiF/2UK3j3C+b/IMXANl HAGXocQmQE3kfVO7BFrg91DAvMftOYXpoEl90sr0rEPZeUqvreLy4GsVVRJW69Jxbmkt 0xCFqoKjkdQYxINJzXgHhHqdkqucoA4L2hqaP5inkJIwNf4TPGg0W7YLsZ43w3CyJyxj +G6/BIet8TKXGKqF8JLm8tFHA6R25LKNxY1cGIKVanQmRZfBz83s5/qO32++cFqLuUny NF6GFpo8dPyw07jmXo3x7IxJXuAhqfRZGWloceievtw+SqFcnj+186E2gqG0jIvjtOsT FNiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=WjXMz27x/U9dOhLhFc+mw0CtFNBv0pNYEu/IydrAFwI=; b=nd/fjLApso7PvRAUqRnFBXgoeB5XeONBHCqJT7RDRJxQvhuVrRipppCRnB1aAKX8M4 DxtYDP27fGBlodYjlDN71qHoWAHp7KfQLM4yl+Cobt+CKFGcOJiuo+Ra1h1dTqmej861 2FeAq1X2XhvB3iceeHVMBKYSCgT7q1v17CVLa/TnWHn0yFUJ5Rvj99TDZSRUVhT5h79G 4JKDSY4uMjRnvybSVCMIwU+2ojvZB+PGySvKU7fTr30nEACOGKGQLxCtOJTV0xs33PZ+ 7BiSzEF0HPQgO3XvgqFwq+dEQKM+01r/9YhHdDjT+9/i3wXHZGb7DumfHA9Dkq2YpvWz JJ5g== X-Gm-Message-State: AGi0PuZXIPjAe7Y5ZBoBqDj98XBMz9I7hxEYrLM8SZG+E7CsXw3MIV52 EyRtY0t2+idzQZCpAPq1EsTQyUM80Rjw2zQnc4G8nQ== X-Google-Smtp-Source: APiQypKuPXq15F4aq1Yndfe+lKP2MltGR7PyrFj7zkN23u5yTJmsX7Ilaookp9W979shxwWu6rDGdV6mWzNEVEx+7zI= X-Received: by 2002:a17:906:4ed6:: with SMTP id i22mr1237666ejv.146.1588090606608; Tue, 28 Apr 2020 09:16:46 -0700 (PDT) In-Reply-To: X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.26 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ntg-context-bounces@ntg.nl Original-Sender: "ntg-context" Xref: news.gmane.io gmane.comp.tex.context:107097 Archived-At: --===============3610420230271110838== Content-Type: multipart/alternative; boundary="0000000000007a650405a45c2888" --0000000000007a650405a45c2888 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thank you for the prompt and thorough response! If the reorderings have to be done for each pair of characters in different combining classes that are not in the expected typographical order, then there will be a lot (probably hundreds) of substitution rules. I am not very familiar with coding in Lua, but if there is a way to add substitution features for specific classes of points, then that would require a lot fewer cases. Unicode's canonical ordering of Hebrew marks is based on their combining classes, with characters in higher combining classes being sorted after those with lower combining classes in canonical order. The typographically recommended ordering of certain characters is found in Table 1 (p. 12) of https://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf. The following list of character classes, with information about their Unicode combining classes (which I retrieved from the Lua script https://raw.githubusercontent.com/michal-h21/uninormalize/master/char-def-w= ith-ccc.lua), is indexed after the character classes described in that table: 1. The consonants (Unicode points 05D0-05EA) have no combining class and are never reordered; this is typographically correct. 2. Shin dot and sin dot (05C1-05C2) should be next, but Unicode places them in combining classes 24 and 25, after the characters in recommended classes 3-5 and many of the characters in recommended class 6. 3. Dagesh / mapiq (05BC) should be next, but Unicode assigns it a combining class of 21. This means that it will be incorrectly ordered before characters in recommended class 2 and after characters in recommended classes 4-6 after Unicode normalization. 4. Rafe (05BF) should be next, but Unicode assigns it a combining class of 23. Thus, it will be correctly placed after characters in recommended class 3, but incorrectly placed before characters in recommended class 2 after Unicode normalization. 5. The holam and holam haser vowel points (05B9-05BA) should be next, but Unicode places them in combining class 19. This means that it will be placed incorrectly before characters in recommended classes 2-4 and after all characters in recommended class 6 except 05BB after Unicode normalization. 6. The characters in 0591, 0596, 059B, 05A2-05A7, 05AA, 05B0-05B8, 05BB, 05BD, 05C5, 05C7 should be treated as being in the same class, but Unicode places them in combining classes 10-18, 20, 22, and 220. 7. The prepositive marks yetiv and dehi (059A, 05AD) should be next; Unicode places them in combining class 222, so they should correctly come after all characters in recommended classes 1-6. 8. The characters 0307, 0593-0595, 0597-0598, 059C-05A1, 05A8, 05AB-05AC, 05AF, 05C4 should be treated as being in the same class; Unicode places them in combining class 230, so they should correctly come after all characters in recommended classes 1-7. 9. The postpositive marks segolta, pashta, telisha qetana, and zinor (0592, 0599, 05A9, 05AE) should be next; Unicode places them in combining class 230, so they will need to be reordered after the characters in recommended class 8. This a lot of information, and I've probably not presented it as clearly as I could, so if there is any confusion, please let me know, and I can try to explain better. If there is any other information you need, please let me know. Thanks again! On Tue, Apr 28, 2020 at 9:17 AM Hans Hagen wrote: > On 4/28/2020 1:59 PM, Joey McCollum wrote: > > \definefontfeature[f:pointedhebrew][default][ > > ccmp=3Dyes, > > mark=3Dyes, > > script=3Dhebr > > ] > > \definefontfamily[hebrew] [rm] [SBL Hebrew] [features=3Df:pointedhebrew= ] > > %Set the body font: > > \setupbodyfont[hebrew] > > %Set up right-to-left alignment: > > \setupalign[r2l] > > \starttext > > %Characters after normalization, in Unicode canonical order (bet + > > segol + dagesh + final nun): > > =D7=91=D6=B6=D6=BC=D7=9F > > > > %A word with characters in typographically recommended order (bet = + > > dagesh + segol + final nun): > > =D7=91=D6=BC=D6=B6=D7=9F > > \stoptext > > \startluacode > fonts.handlers.otf.addfeature { > name =3D "normalizehebrew", > type =3D "chainsubstitution", > prepend =3D 1, > lookups =3D { > { > type =3D "multiple", > data =3D { > [0x5B6] =3D { 0x5BC, 0x5B6 }, > }, > }, > }, > data =3D { > rules =3D { > { > current =3D { { 0x5B6 }, { 0x5BC } }, > lookups =3D { 1, 0 }, > }, > }, > }, > } > \stopluacode > > \definefontfeature > [f:pointedhebrew] > [hebrew] > [normalizehebrew=3Dyes] > > \definefontfamily[hebrew] [rm] [SBL Hebrew] [features=3Df:pointedhebrew] > > \setupbodyfont[hebrew] > > \setupalign[r2l] > > \starttext > =D7=91=D6=B6=D6=BC=D7=9F \quad =D7=91=D6=BC=D6=B6=D7=9F \par > \stoptext > > How many such reorderings are there? (I saw some document about that > font and it sounds like a bit messy wrt all these input variants.) > > (there are several mechanisms in context to deal with such issues, it's > all about getting specs from users i.e. tex is all about control so in > principle it should be doable) > > Hans > > ----------------------------------------------------------------- > Hans Hagen | PRAGMA ADE > Ridderstraat 27 | 8061 GH Hasselt | The Netherlands > tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl > ----------------------------------------------------------------- > --0000000000007a650405a45c2888 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thank you for = the prompt and thorough response!

If the reorderings hav= e to be done for each pair of characters in different combining classes tha= t are not in the expected typographical order, then there will be a lot (pr= obably hundreds) of substitution rules. I am not very familiar with coding = in Lua, but if there is a way to add substitution features for specific cla= sses of points, then that would require a lot fewer cases.

Unicode's=C2=A0canonical ordering of Hebrew marks is based on = their combining classes, with characters in higher combining classes being = sorted after those with lower combining classes in canonical order. The typ= ographically recommended ordering of certain characters is found in Table 1= (p. 12) of=C2=A0https://www.sbl-site.org/Fonts/SBLHebrewUser= Manual1.5x.pdf. The following list of character classes, with informati= on about their Unicode combining classes (which I retrieved from the Lua sc= ript=C2=A0https://raw.githubuserc= ontent.com/michal-h21/uninormalize/master/char-def-with-ccc.lua), is in= dexed after the character classes described in that table:
1. The= consonants (Unicode points 05D0-05EA) have no combining class and are neve= r reordered; this is typographically correct.
2. Shin dot and sin= dot (05C1-05C2) should be next, but Unicode places them in combining class= es 24 and 25, after the characters in recommended classes 3-5 and many of t= he characters in recommended class 6.
3. Dagesh / mapiq (05BC) sh= ould be next, but Unicode assigns it a combining class of 21. This means th= at it will be incorrectly ordered before characters in recommended class 2 = and after characters in recommended classes 4-6=20 after=20 Unicode normalization.
4. Rafe (05BF) should be next, but Unicode= assigns it a combining class of 23. Thus, it will be correctly placed afte= r characters in recommended class 3, but incorrectly placed before characte= rs in recommended class 2=20 after=20 Unicode normalization.
5. The holam and holam haser vowel points = (05B9-05BA) should be next, but Unicode places them in combining class 19. = This means that it will be placed incorrectly before characters in recommen= ded classes 2-4 and after all characters in recommended class 6 except 05BB= after Unicode normalization.
6. The characters in=20 0591, 0596, 059B,=20 05A2-05A7, 05AA, 05B0-05B8, 05BB, 05BD, 05C5, 05C7 should be treated as bei= ng in the same class, but Unicode places them in combining classes 10-18, 2= 0, 22, and 220.
7. The prepositive marks yetiv and dehi=C2=A0= (059A, 05AD) should be next; Unicode places them in combining class 222, so= they should correctly come after all characters in recommended classes 1-6= .
8. The characters 0307, 0593-0595, 0597-0598, 059C-05A1, 05A8, = 05AB-05AC, 05AF, 05C4 should be treated as being in the same class; Unicode= places them in combining class 230, so they should correctly come after al= l characters in recommended classes 1-7.
9. The postpositive marks=20 segolta, pashta,=20 telisha qetana, and zinor (0592, 0599, 05A9, 05AE) should be next; Unicode = places them in combining class 230, so they will need to be reordered after= the characters in recommended class 8.

This a lot= of information, and I've probably not presented it as clearly as I cou= ld, so if there is any confusion, please let me know, and I can try to expl= ain better. If there is any other information you need, please let me know.=

Thanks again!

On Tue, Apr 28, 2020 at 9:17 A= M Hans Hagen <j.h= agen@xs4all.nl> wrote:
On 4/28/2020 1:59 PM, Joey McCollum wrote:
> \definefontfeature[f:pointedhebrew][default][
>=C2=A0 =C2=A0 =C2=A0 ccmp=3Dyes,
>=C2=A0 =C2=A0 =C2=A0 mark=3Dyes,
>=C2=A0 =C2=A0 =C2=A0 script=3Dhebr
> ]
> \definefontfamily[hebrew] [rm] [SBL Hebrew] [features=3Df:pointedhebre= w]
> %Set the body font:
> \setupbodyfont[hebrew]
> %Set up right-to-left alignment:
> \setupalign[r2l]
> \starttext
>=C2=A0 =C2=A0 =C2=A0 %Characters after normalization, in Unicode canoni= cal order (bet +
> segol + dagesh + final nun):
>=C2=A0 =C2=A0 =C2=A0 =D7=91=D6=B6=D6=BC=D7=9F
>
>=C2=A0 =C2=A0 =C2=A0 %A word with characters in typographically recomme= nded order (bet +
> dagesh + segol + final nun):
>=C2=A0 =C2=A0 =C2=A0 =D7=91=D6=BC=D6=B6=D7=9F
> \stoptext

\startluacode
=C2=A0 =C2=A0 =C2=A0fonts.handlers.otf.addfeature {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0name=C2=A0 =C2=A0 =3D "normalizehebr= ew",
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0type=C2=A0 =C2=A0 =3D "chainsubstitu= tion",
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0prepend =3D 1,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0lookups =3D {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0type =3D &quo= t;multiple",
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0data =3D { =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0[0x5B6] =3D { 0x5BC, 0x5B6 },
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0data =3D {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rules =3D {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0current =3D { { 0x5B6 }, { 0x5BC } },
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0lookups =3D { 1, 0 },
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
=C2=A0 =C2=A0 =C2=A0}
\stopluacode

\definefontfeature
=C2=A0 =C2=A0[f:pointedhebrew]
=C2=A0 =C2=A0[hebrew]
=C2=A0 =C2=A0[normalizehebrew=3Dyes]

\definefontfamily[hebrew] [rm] [SBL Hebrew] [features=3Df:pointedhebrew]
\setupbodyfont[hebrew]

\setupalign[r2l]

\starttext
=C2=A0 =C2=A0 =C2=A0=D7=91=D6=B6=D6=BC=D7=9F \quad =D7=91=D6=BC=D6=B6=D7=9F= \par
\stoptext

How many such reorderings are there? (I saw some document about that
font and it sounds like a bit messy wrt all these input variants.)

(there are several mechanisms in context to deal with such issues, it's=
all about getting specs from users i.e. tex is all about control so in
principle it should be doable)

Hans

-----------------------------------------------------------------
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Hans Hagen | PRAGMA ADE
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Ridderstraat 27 | 80= 61 GH Hasselt | The Netherlands
=C2=A0 =C2=A0 =C2=A0 =C2=A0 tel: 038 477 53 69 | www.pragma-ade.nl | www.= pragma-pod.nl
-----------------------------------------------------------------
--0000000000007a650405a45c2888-- --===============3610420230271110838== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX18KSWYgeW91ciBxdWVzdGlvbiBpcyBvZiBpbnRlcmVz dCB0byBvdGhlcnMgYXMgd2VsbCwgcGxlYXNlIGFkZCBhbiBlbnRyeSB0byB0aGUgV2lraSEKCm1h aWxsaXN0IDogbnRnLWNvbnRleHRAbnRnLm5sIC8gaHR0cDovL3d3dy5udGcubmwvbWFpbG1hbi9s aXN0aW5mby9udGctY29udGV4dAp3ZWJwYWdlICA6IGh0dHA6Ly93d3cucHJhZ21hLWFkZS5ubCAv IGh0dHA6Ly9jb250ZXh0LmFhbmhldC5uZXQKYXJjaGl2ZSAgOiBodHRwczovL2JpdGJ1Y2tldC5v cmcvcGhnL2NvbnRleHQtbWlycm9yL2NvbW1pdHMvCndpa2kgICAgIDogaHR0cDovL2NvbnRleHRn YXJkZW4ubmV0Cl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCg== --===============3610420230271110838==--