From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/10061 Path: main.gmane.org!not-for-mail From: Simon Pepping Newsgroups: gmane.comp.tex.context Subject: Re: utf 8 / test file Date: Sun, 8 Dec 2002 21:38:34 +0100 Sender: ntg-context-admin@ntg.nl Message-ID: <20021208203834.GA642@scaprea> References: <5.1.0.14.1.20021207123223.0254e040@server-1> Reply-To: ntg-context@ntg.nl NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1039380954 6751 80.91.224.249 (8 Dec 2002 20:55:54 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 8 Dec 2002 20:55:54 +0000 (UTC) Return-path: Original-Received: from ref.vet.uu.nl ([131.211.172.13] helo=ref.ntg.nl) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 18L8Su-0001ki-00 for ; Sun, 08 Dec 2002 21:55:52 +0100 Original-Received: from ref.ntg.nl (localhost.localdomain [127.0.0.1]) by ref.ntg.nl (Postfix) with ESMTP id B886010AE8; Sun, 8 Dec 2002 21:55:58 +0100 (MET) Original-Received: from hgatenl.hobby.nl (ns.hobby.nl [212.72.224.8]) by ref.ntg.nl (Postfix) with ESMTP id A7FA610AE7 for ; Sun, 8 Dec 2002 21:54:18 +0100 (MET) Original-Received: from hgatenl.hobby.nl (localhost [127.0.0.1]) by hgatenl.hobby.nl (8.12.5/8.12.2) with ESMTP id gB8KsDIF024286 for ; Sun, 8 Dec 2002 21:54:13 +0100 (CET) (envelope-from spepping@scaprea.hobby.nl) Original-Received: (from uucp@localhost) by hgatenl.hobby.nl (8.12.5/8.12.2/Submit) with UUCP id gB8KsDs2024285 for ntg-context@ntg.nl; Sun, 8 Dec 2002 21:54:13 +0100 (CET) Original-Received: from simon by scaprea.salix.nl with local (Exim 3.35 #1 (Debian)) id 18L8CA-0000I2-00; Sun, 08 Dec 2002 21:38:34 +0100 Original-To: ntg-context@ntg.nl Mail-Followup-To: ntg-context@ntg.nl Content-Disposition: inline In-Reply-To: <5.1.0.14.1.20021207123223.0254e040@server-1> User-Agent: Mutt/1.3.28i Errors-To: ntg-context-admin@ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.0.13 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.comp.tex.context:10061 X-Report-Spam: http://spam.gmane.org/gmane.comp.tex.context:10061 Hans, I have looked at how emacs and Unicode browser deal with unicode and fonts. Unicode browser is an application on the CD-ROM that comes with the Unicode 3.0 book. They both use font sets, i.e., collections of fonts that are put together so as to cover a large part of the Unicode range. The unicode browser scans the fonts in the order listed in its configuration file. When it finds a font that provides the sought character, it uses the glyph from that font. It is possible to refine the configuration: One can indicate that a font only contributes a certain range. One can exclude a range from a font. I believe this is a strategy that could be used by other applications. For Context this might be worked out as follows: Each font family must be in a known encoding. When a font family is loaded, the encoding and the associated font family are added to a table of loaded encodings. When a unicode character is sought, the loaded encodings are scanned in the order in which they appear in the table, until an encoding is found that provides a glyph for that character. It is possible that two font families are loaded that overlap in the range covered. Then the glyphs in the overlap area are taken from the font loaded first. This behaviour can be changed by configuring a font to contribute only a certain range of characters, or to exclude a certain range of characters from a font. This is a refinement that might be added later on. The NFSS in LaTeX provides a default encoding for a character (not to be confused with Context's default encoding, which is a different thing). When the character is not found in the current encoding, it is taken from this default encoding. Such a strategy may be more efficient than going through the list of loaded encodings. The above strategy may be efficient for a text that mainly consists of ascii characters. For a text that mainly consists of non-ascii characters, e.g. a chinese text, it requires much processing. Such a situation may be dealt with like encodings: When you are writing in a West European language, it is more efficient to use Latin-1 than utf-8. Similarly, when one is writing in chinese, a more efficient setup with a more limited coverage of characters may be used. I prefer to use font families rather than fonts. This makes it easy to switch from one font family to another, while keeping constant the other font parameters such as shape and weight. I like the way this is done in LaTeX's NFSS. I do not (yet) know much about the way Context organizes its fonts. One should be aware of the difference between character and glyph. Unicode is about characters, typesetters like TeX are about glyphs. It is very well possible that one font provides several variant glyphs for one and the same Unicode character. The user must have some way to express preference for one or the other. I think the user should load the appropriate input regime, as he only knows the encoding of the input file. For XML files it is different; in DocbookInContext I will try to load the appropriate input regime automatically from the encoding mentioned in the xml declaration. Configuring an appropriate font set is difficult. Perhaps font sets should be preconfigured, and fonts should be loaded as available. Good error messages when no font provides a glyph for a character in the text document should alert the user to missing fonts. These are my thoughts. Simon On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote: > Hi, > > I posted > > http://www.pragma-ade.com/temp/titus.pdf > > now, one thing with unicode (utf) is that support needs to have an > associated font / language switch. > > Traditionally, tex font mechanisms have been complicated by the fact that > there are many shapes per font and math has to be dealt with. > > If we're dealing with say sanskrit, is it then safe to assume that > > (1) we can switch to the language (if not yet done) when we encounter a > unicode from the associated char/glyph range > > (2) can we assume that a relatively simple font mechanism is used > (normal,bold,slanted) > > (3) can we assume that only a few (possibly derived from unicode) fonts are > used, or at least one main type of font per language > > (4) can we standardize on utf-8 [and assume some preprocessor if not] > > [let's try to deal with the practical, so what's the practical usage] > > Hans > ------------------------------------------------------------------------- > Hans Hagen | PRAGMA ADE | pragma@wxs.nl > Ridderstraat 27 | 8061 GH Hasselt | The Netherlands > tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com > ------------------------------------------------------------------------- > information: http://www.pragma-ade.com/roadmap.pdf > documentation: http://www.pragma-ade.com/showcase.pdf > ------------------------------------------------------------------------- > > _______________________________________________ > ntg-context mailing list > ntg-context@ntg.nl > http://www.ntg.nl/mailman/listinfo/ntg-context -- Simon Pepping email: spepping@scaprea.hobby.nl