* active strings in luatex? @ 2007-12-28 18:29 Idris Samawi Hamid 2008-01-13 3:31 ` Arthur Reutenauer 0 siblings, 1 reply; 10+ messages in thread From: Idris Samawi Hamid @ 2007-12-28 18:29 UTC (permalink / raw) To: mailing list for ConTeXt users Dear Hans and Taco, This may be of general interest to the Europeans (and indirectly relates to the \sh@ft email): I need the following: LATIN CAPITAL LETTER L WITH TILDE;004C 0303 LATIN SMALL LETTER L WITH TILDE;006C 0303 The proposal is still under consideration for lithuanian and not yet in unicode. In luatex can I make a definition such that such that the string U004C U0303 (l ̃) is always treated as l with tilde above, taking into account italics and without using \~l (which does not work in, eg, footnote)? Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2007-12-28 18:29 active strings in luatex? Idris Samawi Hamid @ 2008-01-13 3:31 ` Arthur Reutenauer 2008-01-13 11:18 ` Taco Hoekwater 2008-01-13 22:59 ` Hans Hagen 0 siblings, 2 replies; 10+ messages in thread From: Arthur Reutenauer @ 2008-01-13 3:31 UTC (permalink / raw) To: Mailing list for ConTeXt users [-- Attachment #1: Type: text/plain, Size: 2942 bytes --] Hello Idris, I didn't see any reply to this e-mail you sent two weeks ago, so I wanted to give it a try: > In luatex can I make a definition such that such that the string > > U004C U0303 (l ̃) > > is always treated as l with tilde above, taking into account italics and > without using \~l (which does not work in, eg, footnote)? What you want here is to support the Unicode combining characters, which isn't straightforward in TeX because according to the Standard, they come after the base letter they modify, while TeX's accent commands are, of course, typed before. So you can't simply make the combining characters active and equivalent to the appropriate accent macros. In traditional TeX, it would have been tempting to make the base letter active instead, but this has a lot of drawbacks, and LuaTeX offers many other possibilities. Here I've used a set of macros that Taco had written a couple of months ago in response to a question by Thomas Schmitz (see http://www.ntg.nl/pipermail/ntg-context/2007/027095.html). The attached file implements the transformation of the sequence <LATIN SMALL LETTER L, COMBINING TILDE> in "\buildtextaccent\texttilde l", which I hope gives the expected result in every circumstance. I've done it only for the small letter, but of course it's easy to adapt to add the capital letter as well. Finally, I wish to clarify a small misunderstanding: you quoted the two lines below: LATIN CAPITAL LETTER L WITH TILDE;004C 0303 LATIN SMALL LETTER L WITH TILDE;006C 0303 with the comment "The proposal is still under consideration for Lithuanian and not yet in Unicode". Actually it is already encoded in Unicode; that is, all the characters you need are present with the appropriate semantics, and you can accurately represent a small l with tilde in Unicode; only, you have to use two characters (U+006C followed by U+0303). The only thing that will be added to Unicode in that respect is the *name* of those strings (I guess you took those two lines from the data files for Unicode version 5.1.0, in beta stage). The corresponding characters, though, will not be added to Unicode, according to a decision which has been made several years ago (I could trace it back to a discussion at the Unicode Technical Committee in October 1999, but I don't know the details). The idea is that it can already be represented as a sequence of characters, and the Unicode Consortium does not wish to make the set of alphabetic characters explode with diacritics. In spite of this, Unicode still wishes to acknowledge that some unencoded accented letters are important in some languages, and provides names for the character sequences representing them, like it does for all the encoded characters. The relevant document that explains this is Unicode Standard Annex #34 (http://www.unicode.org/reports/tr34/). Arthur [-- Attachment #2: combining_tilde.tex --] [-- Type: text/x-tex, Size: 2715 bytes --] % engine=luatex % Macros to handle a particular combining sequence of Unicode characters % in ConTeXt Mark IV by modifying the token list. % © A. Reutenauer, January 2008. % This file is distributed under the terms of the WTF Public License % (http://sam.zoy.org/wftpl/) \usetypescript[iwona] \setupbodyfont[iwona, 14pt] % Convert the sequence <U+006C LATIN SMALL LETTER L, U+0303 COMBINING TILDE> % to an appropriate ConTeXt representation (\buildtextaccent\texttilde l). % Strongly influenced by macros by Taco. % See http://www.ntg.nl/pipermail/ntg-context/2007/027095.html \def\handletokens[#1][#2]{\ctxlua{collectors.handle("#1", #2)}} \def\startcombining{\ctxlua{collectors.install("combining", "stopcombining")}} \startluacode -- The actual conversion function: we loop over the characters in 'str'. function convert_combining(str) -- l is true if we have just read an 'l'. -- t is the list of tokens read thus far. local l, t = false, { } -- The following should check if we read ‘l’ and ‘combining tilde’ -- consecutively. A lot of overhead; it would be much prettier to -- implement a finite automaton :-) for _, v in ipairs(str) do if not l then if v[2] == 0x6c -- v is LATIN SMALL LETTER L: set l to true, and hold then l = true else t[#t+1] = v -- Otherwise, append v to the token list end else -- l is true if v[2] == 0x0303 then -- v is COMBINING TILDE -- Found! Append the ConTeXt sequence for “l with tilde” to t! t[#t+1] = token.create('buildtextaccent') t[#t+1] = token.create('texttilde') t[#t+1] = token.create(0x6c, 11) l = false -- Don't forget to set l back to false else -- This is annoying: we need to check if v is ‘l’ again. t[#t+1] = token.create(0x6c, 11) -- First append the previous ‘l’ if v[2] == 0x6c -- v is LATIN SMALL LETTER L: start all over again then l = true else t[#t+1] = v l = false end end -- of "if l" end -- of "if not l" end -- of for loop return t end -- of function \stopluacode \def\stopcombining {\handletokens[combining][convert_combining] \flushtokens[combining]} % Now we can use \start ... \stopcombining below. \starttext % There are two “l with tilde”: one on the second ‘l’ of “Hell̃o”, and the % other one on “kal̃bame”. (No, Idris, the Lithuanian radical kalb- % doesn't mean “dog” ;-) \startcombining Hell̃o, world! Mẽs visì kal̃bame lietùviškai. \stopcombining \stoptext [-- Attachment #3: combining_tilde.pdf --] [-- Type: application/pdf, Size: 3872 bytes --] [-- Attachment #4: Type: text/plain, Size: 487 bytes --] ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 3:31 ` Arthur Reutenauer @ 2008-01-13 11:18 ` Taco Hoekwater 2008-01-13 14:38 ` Mojca Miklavec 2008-01-13 18:51 ` Arthur Reutenauer 2008-01-13 22:59 ` Hans Hagen 1 sibling, 2 replies; 10+ messages in thread From: Taco Hoekwater @ 2008-01-13 11:18 UTC (permalink / raw) To: Mailing list for ConTeXt users Arthur Reutenauer wrote: > Hello Idris, > > I didn't see any reply to this e-mail you sent two weeks ago, so I > wanted to give it a try: > >> In luatex can I make a definition such that such that the string >> >> U004C U0303 (l ̃) >> >> is always treated as l with tilde above, taking into account italics and >> without using \~l (which does not work in, eg, footnote)? > > What you want here is to support the Unicode combining characters, > which isn't straightforward in TeX because according to the Standard, > they come after the base letter they modify, Which is a fairly annoying syntax for our purpose. > -- The following should check if we read ‘l’ and ‘combining tilde’ > -- consecutively. A lot of overhead; it would be much prettier to > -- implement a finite automaton :-) Thanks for the reminder. We have been thinking about creating an lpeg variant that operates on tokens and/or nodes instead of simple data strings, but that will take quite a bit of work. It would be possible to simplify the loop logic by storing 'v' in a local variable, so that t[] always lags behind one value: function convert_combining(str) local l, t = { }, { } for _, v in ipairs(str) do if v[2] == 0x0303 and l[2] == 0x6c then t[#t+1] = token.create('buildtextaccent') t[#t+1] = token.create('texttilde') end if l[2] then t[#t+1] = l end l = v end if l[2] then t[#t+1] = l end return t end Best wishes, Taco ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 11:18 ` Taco Hoekwater @ 2008-01-13 14:38 ` Mojca Miklavec 2008-01-14 0:20 ` Arthur Reutenauer 2008-01-13 18:51 ` Arthur Reutenauer 1 sibling, 1 reply; 10+ messages in thread From: Mojca Miklavec @ 2008-01-13 14:38 UTC (permalink / raw) To: mailing list for ConTeXt users I only wanted to add a note: XeTeX always converts, say, c + combining caron into a ccaron whenever one exists in the font (and does that on a really low-level). If ccaron doesn't exist (or if there's no such comination in unicode), it simply requests both glyphs from the font (and only modern fonts have those combining glyphs, I assume). In the case of LM, the font has combining characters with zero width with the accent shifted to the left, so that it looks OK on an average glyph (but in general, TeX does a better job with combining characters) unless one requests two accents. In my opinion, LuaTeX should also be able to handle such combining characters somewhere in the early stages (I have never followed the low-level details of LuaTeX - very often "low level" means "mkiv" for LuaTeX, so probably this still means - mkiv should handle that). So, either ccaron or c+combining caron (or l+combining tilde) should behave the same way: - if there's such a glyph in the font, use it - if there is no such glyph, combine the character from c and a caron (but probably not the combining one! - different fonts have different ideas of what a combining character should be) Also, {\v x} and other strange combinations don't work in ConTeXt (I guess it does in plain TeX) since ConTeXt MK II uses a clever way to figure out if such characters exist in the font encoding, but combinations of letters and accents that are not explicitely defined that they should work, are ruled out, which is a pitty. Mojca ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 14:38 ` Mojca Miklavec @ 2008-01-14 0:20 ` Arthur Reutenauer 0 siblings, 0 replies; 10+ messages in thread From: Arthur Reutenauer @ 2008-01-14 0:20 UTC (permalink / raw) To: Mailing list for ConTeXt users > XeTeX always converts, say, c + combining caron into a ccaron whenever > one exists in the font Does it really? I had understood from the last discussion on the XeTeX list that it did not, with the example of capital alpha + combining breathing which was not set correctly. But maybe it's LaTeX's fault? > In the case of LM, the font has > combining characters with zero width with the accent shifted to the > left, so that it looks OK on an average glyph That's a nice trick, but in the case of 'l', it looks really ugly. > (but in general, TeX > does a better job with combining characters) unless one requests two > accents. Sure. > So, either ccaron or c+combining caron (or l+combining tilde) should > behave the same way: Yes, of course. This is Unicode canonical equivalence, explained in the links Idris gave in the Unicode Standard (chapter 2 is "Introduction", chapter 3 is "Conformance", and we may be concerned by chapter "Implementation guidelines", too). > Also, {\v x} and other strange combinations don't work in ConTeXt (I > guess it does in plain TeX) since ConTeXt MK II uses a clever way to > figure out if such characters exist in the font encoding Mapping sequences like "\v c" to the appropriate slot in the current font encoding is quite legitimate; LaTeX does the same with its own font encodings. I didn't know it meant things like "\v x" couldn't be displayed, though. That said, it is something different from supporting combining characters. Arthur ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 11:18 ` Taco Hoekwater 2008-01-13 14:38 ` Mojca Miklavec @ 2008-01-13 18:51 ` Arthur Reutenauer 1 sibling, 0 replies; 10+ messages in thread From: Arthur Reutenauer @ 2008-01-13 18:51 UTC (permalink / raw) To: Mailing list for ConTeXt users > Thanks for the reminder. We have been thinking about creating an > lpeg variant that operates on tokens and/or nodes instead of > simple data strings, but that will take quite a bit of work. That sure would be nice. > It would be possible to simplify the loop logic by storing 'v' in > a local variable, so that t[] always lags behind one value: OK, thanks (but actually I was quite proud to get my code working already on the second try, so I kept it that non-optimal way ;-) Arthur ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 3:31 ` Arthur Reutenauer 2008-01-13 11:18 ` Taco Hoekwater @ 2008-01-13 22:59 ` Hans Hagen 2008-01-13 23:25 ` Idris Samawi Hamid 2008-01-13 23:30 ` Arthur Reutenauer 1 sibling, 2 replies; 10+ messages in thread From: Hans Hagen @ 2008-01-13 22:59 UTC (permalink / raw) To: Mailing list for ConTeXt users Arthur Reutenauer wrote: > Hello Idris, > > I didn't see any reply to this e-mail you sent two weeks ago, so I > wanted to give it a try: > >> In luatex can I make a definition such that such that the string >> >> U004C U0303 (l ̃) >> >> is always treated as l with tilde above, taking into account italics and >> without using \~l (which does not work in, eg, footnote)? > > What you want here is to support the Unicode combining characters, > which isn't straightforward in TeX because according to the Standard, > they come after the base letter they modify, while TeX's accent commands > are, of course, typed before. So you can't simply make the combining > characters active and equivalent to the appropriate accent macros. if i know the precise specs i can build it into the utf collapser, which is way faster than dealing with tokens (mkiv will not have a token parser for the main input, at most for dedicated tasks) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 22:59 ` Hans Hagen @ 2008-01-13 23:25 ` Idris Samawi Hamid 2008-01-13 23:30 ` Arthur Reutenauer 1 sibling, 0 replies; 10+ messages in thread From: Idris Samawi Hamid @ 2008-01-13 23:25 UTC (permalink / raw) To: mailing list for ConTeXt users On Sun, 13 Jan 2008 15:59:27 -0700, Hans Hagen <pragma@wxs.nl> wrote: >>> In luatex can I make a definition such that such that the string >>> >>> U004C U0303 (l ̃) >>> >>> is always treated as l with tilde above, taking into account italics >>> and >>> without using \~l (which does not work in, eg, footnote)? >> >> What you want here is to support the Unicode combining characters, >> which isn't straightforward in TeX because according to the Standard, >> they come after the base letter they modify, while TeX's accent commands >> are, of course, typed before. So you can't simply make the combining >> characters active and equivalent to the appropriate accent macros. > > if i know the precise specs i can build it into the utf collapser, which > is way faster than dealing with tokens (mkiv will not have a token > parser for the main input, at most for dedicated tasks) This may be a good place to start: http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf see pages 48--54 don't know if this is precise enough... See also pp.~109--117 of http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf which seems even more precise. See also http://www.unicode.org/versions/Unicode5.0.0/UnicodeBookIX.pdf under "combining" Best wishes Idris -- Professor Idris Samawi Hamid, Editor-in-Chief International Journal of Shi`i Studies Department of Philosophy Colorado State University Fort Collins, CO 80523 -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 22:59 ` Hans Hagen 2008-01-13 23:25 ` Idris Samawi Hamid @ 2008-01-13 23:30 ` Arthur Reutenauer 2008-01-14 8:59 ` Hans Hagen 1 sibling, 1 reply; 10+ messages in thread From: Arthur Reutenauer @ 2008-01-13 23:30 UTC (permalink / raw) To: Mailing list for ConTeXt users > if i know the precise specs i can build it into the utf collapser I can work that out for you, but we need to think about how to treat all this consistently, in particular with respect to the questions Mojca raised: · Equivalent sequences need to be treated the same way (c + combining caron == ccaron). · If we need to compose a glyph out of other glyphs, it may be accounted for by some OpenType feature (in particular GPOS 'mark' and 'mkmk'). · If nothing else is available, the good ol' TeX way using \accent is still valid, but we need to preserve the original Unicode data in the PDF (for searching, etc.). Arthur ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: active strings in luatex? 2008-01-13 23:30 ` Arthur Reutenauer @ 2008-01-14 8:59 ` Hans Hagen 0 siblings, 0 replies; 10+ messages in thread From: Hans Hagen @ 2008-01-14 8:59 UTC (permalink / raw) To: Mailing list for ConTeXt users Arthur Reutenauer wrote: >> if i know the precise specs i can build it into the utf collapser > > I can work that out for you, but we need to think about how to treat > all this consistently, in particular with respect to the questions Mojca > raised: > > · Equivalent sequences need to be treated the same way (c + combining > caron == ccaron). > · If we need to compose a glyph out of other glyphs, it may be > accounted for by some OpenType feature (in particular GPOS 'mark' > and 'mkmk'). > · If nothing else is available, the good ol' TeX way using \accent is > still valid, but we need to preserve the original Unicode data in > the PDF (for searching, etc.). (1) mkiv already has (actually it was one of the first thing simplemented) an utf composition handler; this one is initialized using the big char table which has information about the formal composition sequences an option is to add more to this (like the lcaron); fo rthose who want to play with it i added a command (beta upload) \definecomposedutf 318 108 126 % lcaron keep in mind that this acts on the input, so it may mess up definitions that contain l~ sequences; any input processing cq. token processing (later stage) is kind of dangerous (2) it is possible (but no handy interface yet, i may make it a 'context' font feature) to complete a font with all it's composed char susing virtual fotn trickery (see mk.pdf) which resolves the missing glyph issue (3) letter this year (after mplib) we will pick up a 'glyph not present in font' callback that's on our agenda (4) another option is to deal with it in the node list handlers, but if possible i want to avoid this (the more passes, the slower) [there is already quite some framework present in mikv, but not always interfaced; much of this is also used in performance testing and such and some is reported in mk.pdf] ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com | www.pragma-pod.nl ----------------------------------------------------------------- ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________ ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2008-01-14 8:59 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-12-28 18:29 active strings in luatex? Idris Samawi Hamid 2008-01-13 3:31 ` Arthur Reutenauer 2008-01-13 11:18 ` Taco Hoekwater 2008-01-13 14:38 ` Mojca Miklavec 2008-01-14 0:20 ` Arthur Reutenauer 2008-01-13 18:51 ` Arthur Reutenauer 2008-01-13 22:59 ` Hans Hagen 2008-01-13 23:25 ` Idris Samawi Hamid 2008-01-13 23:30 ` Arthur Reutenauer 2008-01-14 8:59 ` Hans Hagen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).