Re: utf 8 / test file

From: Simon Pepping <spepping@scaprea.hobby.nl>
Subject: Re: utf 8 / test file
Date: Sun, 8 Dec 2002 21:38:34 +0100	[thread overview]
Message-ID: <20021208203834.GA642@scaprea> (raw)
In-Reply-To: <5.1.0.14.1.20021207123223.0254e040@server-1>

Hans,

I have looked at how emacs and Unicode browser deal with unicode and
fonts. Unicode browser is an application on the CD-ROM that comes with
the Unicode 3.0 book. They both use font sets, i.e., collections of
fonts that are put together so as to cover a large part of the Unicode
range. The unicode browser scans the fonts in the order listed in its
configuration file. When it finds a font that provides the sought
character, it uses the glyph from that font. It is possible to refine
the configuration: One can indicate that a font only contributes a
certain range. One can exclude a range from a font. I believe this is
a strategy that could be used by other applications.

For Context this might be worked out as follows: Each font family must
be in a known encoding. When a font family is loaded, the encoding and
the associated font family are added to a table of loaded
encodings. When a unicode character is sought, the loaded encodings
are scanned in the order in which they appear in the table, until an
encoding is found that provides a glyph for that character.

It is possible that two font families are loaded that overlap in the
range covered. Then the glyphs in the overlap area are taken from the
font loaded first. This behaviour can be changed by configuring a font
to contribute only a certain range of characters, or to exclude a
certain range of characters from a font. This is a refinement that
might be added later on.

The NFSS in LaTeX provides a default encoding for a character (not to
be confused with Context's default encoding, which is a different
thing). When the character is not found in the current encoding, it is
taken from this default encoding. Such a strategy may be more
efficient than going through the list of loaded encodings.

The above strategy may be efficient for a text that mainly consists of
ascii characters. For a text that mainly consists of non-ascii
characters, e.g. a chinese text, it requires much processing. Such a
situation may be dealt with like encodings: When you are writing in a
West European language, it is more efficient to use Latin-1 than
utf-8. Similarly, when one is writing in chinese, a more efficient
setup with a more limited coverage of characters may be used.

I prefer to use font families rather than fonts. This makes it easy to
switch from one font family to another, while keeping constant the
other font parameters such as shape and weight. I like the way this is
done in LaTeX's NFSS. I do not (yet) know much about the way Context
organizes its fonts.

One should be aware of the difference between character and
glyph. Unicode is about characters, typesetters like TeX are about
glyphs. It is very well possible that one font provides several
variant glyphs for one and the same Unicode character. The user must
have some way to express preference for one or the other.

I think the user should load the appropriate input regime, as he only
knows the encoding of the input file. For XML files it is different;
in DocbookInContext I will try to load the appropriate input regime
automatically from the encoding mentioned in the xml declaration.

Configuring an appropriate font set is difficult. Perhaps font sets
should be preconfigured, and fonts should be loaded as available. Good
error messages when no font provides a glyph for a character in the
text document should alert the user to missing fonts.

These are my thoughts.

Simon

On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:
> Hi,
> 
> I posted
> 
>   http://www.pragma-ade.com/temp/titus.pdf
> 
> now, one thing with unicode (utf) is that support needs to have an 
> associated font / language switch.
> 
> Traditionally, tex font mechanisms have been complicated by the fact that 
> there are many shapes per font and math has to be dealt with.
> 
> If we're dealing with say sanskrit, is it then safe to assume that
> 
> (1) we can switch to the language (if not yet done) when we encounter a 
> unicode from the associated char/glyph range
> 
> (2) can we assume that a relatively simple font mechanism is used 
> (normal,bold,slanted)
> 
> (3) can we assume that only a few (possibly derived from unicode) fonts are 
> used, or at least one main type of font per language
> 
> (4) can we standardize on utf-8 [and assume some preprocessor if not]
> 
> [let's try to deal with the practical, so what's the practical usage]
> 
> Hans
> -------------------------------------------------------------------------
>                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
>                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
>  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
> -------------------------------------------------------------------------
>                        information: http://www.pragma-ade.com/roadmap.pdf
>                     documentation: http://www.pragma-ade.com/showcase.pdf
> -------------------------------------------------------------------------
> 
> _______________________________________________
> ntg-context mailing list
> ntg-context@ntg.nl
> http://www.ntg.nl/mailman/listinfo/ntg-context

-- 
Simon Pepping
email: spepping@scaprea.hobby.nl