ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* utf 8 / test file
@ 2002-12-07 11:38 Hans Hagen
  2002-12-08 20:38 ` Simon Pepping
  2002-12-09 20:44 ` Simon Pepping
  0 siblings, 2 replies; 9+ messages in thread
From: Hans Hagen @ 2002-12-07 11:38 UTC (permalink / raw)


Hi,

I posted

   http://www.pragma-ade.com/temp/titus.pdf

now, one thing with unicode (utf) is that support needs to have an 
associated font / language switch.

Traditionally, tex font mechanisms have been complicated by the fact that 
there are many shapes per font and math has to be dealt with.

If we're dealing with say sanskrit, is it then safe to assume that

(1) we can switch to the language (if not yet done) when we encounter a 
unicode from the associated char/glyph range

(2) can we assume that a relatively simple font mechanism is used 
(normal,bold,slanted)

(3) can we assume that only a few (possibly derived from unicode) fonts are 
used, or at least one main type of font per language

(4) can we standardize on utf-8 [and assume some preprocessor if not]

[let's try to deal with the practical, so what's the practical usage]

Hans
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf 8 / test file
  2002-12-07 11:38 utf 8 / test file Hans Hagen
@ 2002-12-08 20:38 ` Simon Pepping
  2002-12-08 23:26   ` Hans Hagen
  2002-12-09 20:44 ` Simon Pepping
  1 sibling, 1 reply; 9+ messages in thread
From: Simon Pepping @ 2002-12-08 20:38 UTC (permalink / raw)


Hans,

I have looked at how emacs and Unicode browser deal with unicode and
fonts. Unicode browser is an application on the CD-ROM that comes with
the Unicode 3.0 book. They both use font sets, i.e., collections of
fonts that are put together so as to cover a large part of the Unicode
range. The unicode browser scans the fonts in the order listed in its
configuration file. When it finds a font that provides the sought
character, it uses the glyph from that font. It is possible to refine
the configuration: One can indicate that a font only contributes a
certain range. One can exclude a range from a font. I believe this is
a strategy that could be used by other applications.

For Context this might be worked out as follows: Each font family must
be in a known encoding. When a font family is loaded, the encoding and
the associated font family are added to a table of loaded
encodings. When a unicode character is sought, the loaded encodings
are scanned in the order in which they appear in the table, until an
encoding is found that provides a glyph for that character.

It is possible that two font families are loaded that overlap in the
range covered. Then the glyphs in the overlap area are taken from the
font loaded first. This behaviour can be changed by configuring a font
to contribute only a certain range of characters, or to exclude a
certain range of characters from a font. This is a refinement that
might be added later on.

The NFSS in LaTeX provides a default encoding for a character (not to
be confused with Context's default encoding, which is a different
thing). When the character is not found in the current encoding, it is
taken from this default encoding. Such a strategy may be more
efficient than going through the list of loaded encodings.

The above strategy may be efficient for a text that mainly consists of
ascii characters. For a text that mainly consists of non-ascii
characters, e.g. a chinese text, it requires much processing. Such a
situation may be dealt with like encodings: When you are writing in a
West European language, it is more efficient to use Latin-1 than
utf-8. Similarly, when one is writing in chinese, a more efficient
setup with a more limited coverage of characters may be used.

I prefer to use font families rather than fonts. This makes it easy to
switch from one font family to another, while keeping constant the
other font parameters such as shape and weight. I like the way this is
done in LaTeX's NFSS. I do not (yet) know much about the way Context
organizes its fonts.

One should be aware of the difference between character and
glyph. Unicode is about characters, typesetters like TeX are about
glyphs. It is very well possible that one font provides several
variant glyphs for one and the same Unicode character. The user must
have some way to express preference for one or the other.

I think the user should load the appropriate input regime, as he only
knows the encoding of the input file. For XML files it is different;
in DocbookInContext I will try to load the appropriate input regime
automatically from the encoding mentioned in the xml declaration.

Configuring an appropriate font set is difficult. Perhaps font sets
should be preconfigured, and fonts should be loaded as available. Good
error messages when no font provides a glyph for a character in the
text document should alert the user to missing fonts.

These are my thoughts.

Simon

On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:
> Hi,
> 
> I posted
> 
>   http://www.pragma-ade.com/temp/titus.pdf
> 
> now, one thing with unicode (utf) is that support needs to have an 
> associated font / language switch.
> 
> Traditionally, tex font mechanisms have been complicated by the fact that 
> there are many shapes per font and math has to be dealt with.
> 
> If we're dealing with say sanskrit, is it then safe to assume that
> 
> (1) we can switch to the language (if not yet done) when we encounter a 
> unicode from the associated char/glyph range
> 
> (2) can we assume that a relatively simple font mechanism is used 
> (normal,bold,slanted)
> 
> (3) can we assume that only a few (possibly derived from unicode) fonts are 
> used, or at least one main type of font per language
> 
> (4) can we standardize on utf-8 [and assume some preprocessor if not]
> 
> [let's try to deal with the practical, so what's the practical usage]
> 
> Hans
> -------------------------------------------------------------------------
>                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
>                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
>  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
> -------------------------------------------------------------------------
>                        information: http://www.pragma-ade.com/roadmap.pdf
>                     documentation: http://www.pragma-ade.com/showcase.pdf
> -------------------------------------------------------------------------
> 
> _______________________________________________
> ntg-context mailing list
> ntg-context@ntg.nl
> http://www.ntg.nl/mailman/listinfo/ntg-context

-- 
Simon Pepping
email: spepping@scaprea.hobby.nl

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf 8 / test file
  2002-12-08 20:38 ` Simon Pepping
@ 2002-12-08 23:26   ` Hans Hagen
  2002-12-09  9:40     ` Taco Hoekwater
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Hans Hagen @ 2002-12-08 23:26 UTC (permalink / raw)


At 09:38 PM 12/8/2002 +0100, you wrote:

>I have looked at how emacs and Unicode browser deal with unicode and
>fonts. Unicode browser is an application on the CD-ROM that comes with
>the Unicode 3.0 book. They both use font sets, i.e., collections of

so i have to buy that book -) what is the best place to get it?

For Context this might be worked out as follows: Each font family must
>be in a known encoding. When a font family is loaded, the encoding and
>the associated font family are added to a table of loaded
>encodings. When a unicode character is sought, the loaded encodings
>are scanned in the order in which they appear in the table, until an
>encoding is found that provides a glyph for that character.

hm, must think this over, esp since tex has no way (except measuring) to 
determine if a slot is really taken

>It is possible that two font families are loaded that overlap in the
>range covered. Then the glyphs in the overlap area are taken from the
>font loaded first. This behaviour can be changed by configuring a font
>to contribute only a certain range of characters, or to exclude a
>certain range of characters from a font. This is a refinement that
>might be added later on.
>
>The NFSS in LaTeX provides a default encoding for a character (not to
>be confused with Context's default encoding, which is a different
>thing). When the character is not found in the current encoding, it is
>taken from this default encoding. Such a strategy may be more
>efficient than going through the list of loaded encodings.

eh ... context does have fall backs (nearly always something default, often 
very plain); if something does not show up, it's probably not defined 
(yet); so, maybe i misunderstand you

>The above strategy may be efficient for a text that mainly consists of
>ascii characters. For a text that mainly consists of non-ascii
>characters, e.g. a chinese text, it requires much processing. Such a
>situation may be dealt with like encodings: When you are writing in a
>West European language, it is more efficient to use Latin-1 than
>utf-8. Similarly, when one is writing in chinese, a more efficient
>setup with a more limited coverage of characters may be used.

chinese is even more complicated: there can be mixed utf-like encodings, 
and chars need some kind of postprocessing (adding breakpoints and so, or 
rotation in vertical typesetting, and/or special numbering things; this is 
already handled;)

>I prefer to use font families rather than fonts. This makes it easy to
>switch from one font family to another, while keeping constant the
>other font parameters such as shape and weight. I like the way this is
>done in LaTeX's NFSS. I do not (yet) know much about the way Context
>organizes its fonts.

the organization is roughly the same as in any tex (a few axis); for 
scripts like chinese, names like SomeNiceFont automatically expand into 
SomeNiceFontBold at a certain size; this is a byproduct of using symbolic 
filenames; it also means a pretty nice way of mixing latin, idiographic, 
and math scripts.

>One should be aware of the difference between character and
>glyph. Unicode is about characters, typesetters like TeX are about
>glyphs. It is very well possible that one font provides several
>variant glyphs for one and the same Unicode character. The user must
>have some way to express preference for one or the other.

i read somewhere that unicode is about scripts -)

you're right; somehow we need to deal with the open type language dependent 
glyphs; pretty nasty

>I think the user should load the appropriate input regime, as he only
>knows the encoding of the input file. For XML files it is different;
>in DocbookInContext I will try to load the appropriate input regime
>automatically from the encoding mentioned in the xml declaration.
>
>Configuring an appropriate font set is difficult. Perhaps font sets
>should be preconfigured, and fonts should be loaded as available. Good
>error messages when no font provides a glyph for a character in the
>text document should alert the user to missing fonts.

Indeed i think that we should have some reasonable defaults, and it seems 
that there are no free complete unicode fonts, so we probably end up with 
something

<range> => defaultfont

but maybe even with

<subrange> => defaultfont

this needs some research.

Thanks for your input.

Hans
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf 8 / test file
  2002-12-08 23:26   ` Hans Hagen
@ 2002-12-09  9:40     ` Taco Hoekwater
  2002-12-09 10:40     ` Re[2]: " Giuseppe Bilotta
  2002-12-09 20:32     ` Simon Pepping
  2 siblings, 0 replies; 9+ messages in thread
From: Taco Hoekwater @ 2002-12-09  9:40 UTC (permalink / raw)


On Mon, 09 Dec 2002 00:26:16 +0100, Hans wrote:

> At 09:38 PM 12/8/2002 +0100, you wrote:
> 
> >I have looked at how emacs and Unicode browser deal with unicode and
> >fonts. Unicode browser is an application on the CD-ROM that comes with
> >the Unicode 3.0 book. They both use font sets, i.e., collections of
> 
> so i have to buy that book -) what is the best place to get it?

www.unicode.org
 

-- 
groeten,

Taco

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re[2]: utf 8 / test file
  2002-12-08 23:26   ` Hans Hagen
  2002-12-09  9:40     ` Taco Hoekwater
@ 2002-12-09 10:40     ` Giuseppe Bilotta
  2002-12-09 11:30       ` Hans Hagen
  2002-12-09 20:32     ` Simon Pepping
  2 siblings, 1 reply; 9+ messages in thread
From: Giuseppe Bilotta @ 2002-12-09 10:40 UTC (permalink / raw)


Monday, December 9, 2002 Hans Hagen wrote:

HH> hm, must think this over, esp since tex has no way (except measuring) to
HH> determine if a slot is really taken

e-TeX can, IIRC. And since UTF support requires e-TeX anyway ...

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re[2]: utf 8 / test file
  2002-12-09 10:40     ` Re[2]: " Giuseppe Bilotta
@ 2002-12-09 11:30       ` Hans Hagen
  0 siblings, 0 replies; 9+ messages in thread
From: Hans Hagen @ 2002-12-09 11:30 UTC (permalink / raw)


At 11:40 AM 12/9/2002 +0100, you wrote:
>Monday, December 9, 2002 Hans Hagen wrote:
>
>HH> hm, must think this over, esp since tex has no way (except measuring) to
>HH> determine if a slot is really taken
>
>e-TeX can, IIRC. And since UTF support requires e-TeX anyway ...

sure, but that still leaves the check if file exists probel, althoug a way 
out is to add the tfm paths to the tex search paths

Hans

-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf 8 / test file
  2002-12-08 23:26   ` Hans Hagen
  2002-12-09  9:40     ` Taco Hoekwater
  2002-12-09 10:40     ` Re[2]: " Giuseppe Bilotta
@ 2002-12-09 20:32     ` Simon Pepping
  2 siblings, 0 replies; 9+ messages in thread
From: Simon Pepping @ 2002-12-09 20:32 UTC (permalink / raw)


On Mon, Dec 09, 2002 at 12:26:16AM +0100, Hans Hagen wrote:
> At 09:38 PM 12/8/2002 +0100, you wrote:
> 
> For Context this might be worked out as follows: Each font family must
> >be in a known encoding. When a font family is loaded, the encoding and
> >the associated font family are added to a table of loaded
> >encodings. When a unicode character is sought, the loaded encodings
> >are scanned in the order in which they appear in the table, until an
> >encoding is found that provides a glyph for that character.
> 
> hm, must think this over, esp since tex has no way (except measuring) to 
> determine if a slot is really taken

My idea was that the encoding should indicate which slots are
provided (if the font complies).
 
> >The NFSS in LaTeX provides a default encoding for a character (not to
> >be confused with Context's default encoding, which is a different
> >thing). When the character is not found in the current encoding, it is
> >taken from this default encoding. Such a strategy may be more
> >efficient than going through the list of loaded encodings.
> 
> eh ... context does have fall backs (nearly always something default, often 
> very plain); if something does not show up, it's probably not defined 
> (yet); so, maybe i misunderstand you

I do not see this as a fallback but as an optimization. It is an
effective means of knowing which encoding is on top for a certain
character.

> Indeed i think that we should have some reasonable defaults, and it seems 
> that there are no free complete unicode fonts, so we probably end up with 
> something

There are apps, e.g. XMLSpy, that rely on a single font to provide all
required characters. I find that a waste of resources; the user's
fonts are used much better if they can combined into a set.

Simon

-- 
Simon Pepping
email: spepping@scaprea.hobby.nl

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf 8 / test file
  2002-12-07 11:38 utf 8 / test file Hans Hagen
  2002-12-08 20:38 ` Simon Pepping
@ 2002-12-09 20:44 ` Simon Pepping
  2002-12-10  9:54   ` Hans Hagen
  1 sibling, 1 reply; 9+ messages in thread
From: Simon Pepping @ 2002-12-09 20:44 UTC (permalink / raw)


On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:
> Hi,
> 
> I posted
> 
>   http://www.pragma-ade.com/temp/titus.pdf

U+0E5B = \char14:91
\startunicodevector 34

Can these numbers also be given in hexadecimal, e.g., \char "E:"5B?
Unicode data sheets and font layout tables are usually given in
hexadecimal. I find myself converting from hex to decimal and back; it
would be easier to remain in hex.

Simon

-- 
Simon Pepping
email: spepping@scaprea.hobby.nl

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf 8 / test file
  2002-12-09 20:44 ` Simon Pepping
@ 2002-12-10  9:54   ` Hans Hagen
  0 siblings, 0 replies; 9+ messages in thread
From: Hans Hagen @ 2002-12-10  9:54 UTC (permalink / raw)


At 09:44 PM 12/9/2002 +0100, you wrote:
>On Sat, Dec 07, 2002 at 12:38:46PM +0100, Hans Hagen wrote:
> > Hi,
> >
> > I posted
> >
> >   http://www.pragma-ade.com/temp/titus.pdf
>
>U+0E5B = \char14:91
>\startunicodevector 34
>
>Can these numbers also be given in hexadecimal, e.g., \char "E:"5B?
>Unicode data sheets and font layout tables are usually given in
>hexadecimal. I find myself converting from hex to decimal and back; it
>would be easier to remain in hex.

in unic-ini (at the end) change:

\chardef\utfunicommandmode=0 % 1 = hex

\def\unicodecommandchar#1#2%
   {\string\char
    \ifcase\utfunicommandmode
      #1:#2\else\lchexnumbers#1:\lchexnumbers#2%
    \fi}

\def\utfunifontcommand#1%
   {\xdef\unidiv{\number\utfdiv{#1}}%
    \xdef\unimod{\number\utfmod{#1}}%
    \ifnum#1<\utf@i
      \unicodecommandchar\unidiv\unimod
    \else\ifcsname\@@univector\unidiv\endcsname
      \@EA\string\csname\doutfunihash{\unidiv}{#1}\endcsname
    \else
      \unicodecommandchar\unidiv\unimod
    \fi\fi}
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-12-10  9:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-12-07 11:38 utf 8 / test file Hans Hagen
2002-12-08 20:38 ` Simon Pepping
2002-12-08 23:26   ` Hans Hagen
2002-12-09  9:40     ` Taco Hoekwater
2002-12-09 10:40     ` Re[2]: " Giuseppe Bilotta
2002-12-09 11:30       ` Hans Hagen
2002-12-09 20:32     ` Simon Pepping
2002-12-09 20:44 ` Simon Pepping
2002-12-10  9:54   ` Hans Hagen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).