From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/17610 Path: main.gmane.org!not-for-mail From: "Adam Lindsay" Newsgroups: gmane.comp.tex.context Subject: Re: (Con)TeX(t), Unicode and accented characters Date: Tue, 21 Dec 2004 10:01:24 +0000 Message-ID: <20041221100124.1568@news.comp.lancs.ac.uk> References: <47395.62.251.0.62.1103615800.squirrel@62.251.0.62> Reply-To: mailing list for ConTeXt users NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: sea.gmane.org 1103623319 17242 80.91.229.6 (21 Dec 2004 10:01:59 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 21 Dec 2004 10:01:59 +0000 (UTC) Original-X-From: ntg-context-bounces@ntg.nl Tue Dec 21 11:01:50 2004 Return-path: Original-Received: from ronja.vet.uu.nl ([131.211.172.88] helo=ronja.ntg.nl) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1Cggpx-0006uc-00 for ; Tue, 21 Dec 2004 11:01:49 +0100 Original-Received: from localhost (localhost.localdomain [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 33D92127A6; Tue, 21 Dec 2004 11:01:49 +0100 (CET) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (ronja.vet.uu.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 23918-02; Tue, 21 Dec 2004 11:01:47 +0100 (CET) Original-Received: from ronja.vet.uu.nl (localhost.localdomain [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 81EF41278D; Tue, 21 Dec 2004 11:01:47 +0100 (CET) Original-Received: from localhost (localhost.localdomain [127.0.0.1]) by ronja.ntg.nl (Postfix) with ESMTP id 6D7381278D for ; Tue, 21 Dec 2004 11:01:45 +0100 (CET) Original-Received: from ronja.ntg.nl ([127.0.0.1]) by localhost (ronja.vet.uu.nl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 23978-01 for ; Tue, 21 Dec 2004 11:01:44 +0100 (CET) Original-Received: from mail.comp.lancs.ac.uk (unknown [148.88.3.45]) by ronja.ntg.nl (Postfix) with ESMTP id 80C6A1278C for ; Tue, 21 Dec 2004 11:01:44 +0100 (CET) Original-Received: from [192.168.31.101] (localhost [127.0.0.1]) by mail.comp.lancs.ac.uk (8.12.10/8.12.10) with ESMTP id iBLA1S1k027126; Tue, 21 Dec 2004 10:01:28 GMT Original-To: "mailing list for ConTeXt users" , , In-Reply-To: <47395.62.251.0.62.1103615800.squirrel@62.251.0.62> X-Mailer: CTM PowerMail version 5.1 build 4340 English X-Virus-Scanned: by amavisd-new at ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.1.5 Precedence: list List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: ntg-context-bounces@ntg.nl Errors-To: ntg-context-bounces@ntg.nl X-Virus-Scanned: by amavisd-new at ntg.nl Xref: main.gmane.org gmane.comp.tex.context:17610 X-Report-Spam: http://spam.gmane.org/gmane.comp.tex.context:17610 r.ermers@hccnet.nl said this at Tue, 21 Dec 2004 08:56:40 +0100: >In Latex the combination \"{a} can mean two things: >1. in most fonts: show the charachter on the a given numerical position, >which means that there is one character =E4. > >2. in some other fonts \"{a} means: combine " with a and make an =E4. This >means that " is combined with the character on the numerical position of >a. TeX does this very well and thus construes very acceptable diacritical >signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts. Robert, That's a helpful explanation. I'll try to expand on that in the ConTeXt case, just in case people are curious or are led into thinking it's just the same: In ConTeXt, the combination \"{a} means one thing: \adiaeresis (see enco- acc). This \adiaeresis can mean one of two things, depending on the = encoding: 1. Numerical position, or 2. The fallback case (defined in enco-def), where a diaeresis/umlaut is placed atop an 'a' glyph. Hyphenation implications as Hans described. The interesting/helpful thing about ConTeXt is that internally, that glyph is given a consistent name, no matter how it is input or output. So, if you type =E4 in your given input regime, and that encoding is properly set, that numerical =E4 (e.g., character #228 in the windows regime) is mapped to \adiaeresis. Wanna know what happens in UTF-8=3F Here's my 'simplified' explanation: In a UTF-8 bytestream, that character "=E4" is signified by two bytes: 0xC3, 0xA4. That first byte triggers a conversion of both bytes into two different bytes, the actual Unicode number, 0x00 0xE4 (or: 0, 228). ConTeXt then looks into internal hashes set up (in this case, the unic- 000 vector), looks at the 228th element, and sees that it's \adiaeresis. Things then proceed as normal. :) (It's also interesting to note that for PostScript and TrueType fonts, that number > name > number (glyph) mapping happens yet again in the driver. But all that is outside of TeX proper, so to say any more would be confusing.) -- =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D= -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D Adam T. Lindsay, Computing Dept. atl@comp.lancs.ac.uk Lancaster University, InfoLab21 +44(0)1524/510.514 Lancaster, LA1 4WA, UK Fax:+44(0)1524/510.492 -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-= =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-