ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* unicode and out-of-box usability
@ 2004-01-02 17:59 Adam Lindsay
  2004-01-03 22:38 ` Hans Hagen
  0 siblings, 1 reply; 10+ messages in thread
From: Adam Lindsay @ 2004-01-02 17:59 UTC (permalink / raw)


Hi, folks,

I've been struggling through, trying to learn Unicode in ConTeXt. It's
been instructive, at least. (Hope to make a MyWay about it...)
There are a few weird things that made it difficult to learn, and I was
wondering if someone could help explain why things are the way they are.

In unic-ini:
\chardef\utfunihashmode=0 % 1 = enabled 

Actually, if I understand things correctly, '1' means "disabled", which
is what I preferred, having not yet created any unicode vectors. So the
internal documentation there seems wrong, and I would argue the default
case (0) makes it harder for beginners.


More confusingly, in font-uni:
\def\enableunicodefont#1%
  {\definefontsynonym[\s!Unicode][\getvalue{\??uc#1\c!file}]%
   \def\unicodescale             {\getvalue{\??uc#1\c!schaal}}%
   \def\unicodeheight            {\getvalue{\??uc#1\c!hoogte}}%
   \def\unicodedepth             {\getvalue{\??uc#1\c!diepte}}%
   \def\unicodedigits            {\getvalue{\??uc#1\c!conversie}}%
   \def\handleunicodeglyph       {\getvalue{\??uc#1\c!commando}}%
%%%%%%%%%%% NEXT LINE
   \enableregime[unicode]% the following \relax's are realy needed
   \doifvalue{\??uc#1\c!interlinie}\v!ja\setupinterlinespace\relax
   \getvalue{\??uc#1\c!commandos}\relax}

The \enableregime[unicode] runs in direct opposition with the
\enableregime[utf] that normally goes at the start of (some of my)
documents. As it stands, with the regime hard-coded, users have to put an
\enableregime[utf] *after* the font declaration. That's awkward.


The last proposed change/complaint is back in unic-ini, and came from my
attempts to match the main body font with the unicode font. 

\def\utfunifontglyph#1%
  {\xdef\unidiv{\number\utfdiv{#1}}%
   \xdef\unimod{\number\utfmod{#1}}%
   \ifnum#1<\utf@i
%%%% \unicodeasciicharacter\unimod
     \char\unimod % \unicodeascii\unimod
   \else\ifcsname\@@univector\unidiv\endcsname
     \csname\doutfunihash{\unidiv}{#1}\endcsname
   \else % so, these can be different fonts !
     \unicodeglyph\unidiv\unimod % no \uchar (yet)
   \fi\fi}

Basically, I'd like to use the \unicodeasciicharacter hook with this
definition:

\def\unicodeasciicharacter{\uchar{0}}

(I'm not certain the above is release-quality code, but I've been testing
it with a stripped down \utfunifontglyph that should be functionally
equivalent.)

Working with the unicode code makes me appreciate that it's really
powerful part of ConTeXt. Thanks, Hans!

gelukkig nieuwjaar,
adam
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unicode and out-of-box usability
  2004-01-02 17:59 unicode and out-of-box usability Adam Lindsay
@ 2004-01-03 22:38 ` Hans Hagen
  2004-01-05 13:43   ` Adam Lindsay
  2004-01-05 14:23   ` My Way: Unicode Symbols Adam Lindsay
  0 siblings, 2 replies; 10+ messages in thread
From: Hans Hagen @ 2004-01-03 22:38 UTC (permalink / raw)


At 18:59 02/01/2004, you wrote:

>I've been struggling through, trying to learn Unicode in ConTeXt. It's
>been instructive, at least. (Hope to make a MyWay about it...)

good

>There are a few weird things that made it difficult to learn, and I was
>wondering if someone could help explain why things are the way they are.
>
>In unic-ini:
>\chardef\utfunihashmode=0 % 1 = enabled
>
>Actually, if I understand things correctly, '1' means "disabled", which
>is what I preferred, having not yet created any unicode vectors. So the
>internal documentation there seems wrong, and I would argue the default
>case (0) makes it harder for beginners.

hm, did you look at the unic-001 etc files? the trick is in fast and efficient
expansion without the need to define lots of named glyphs

>More confusingly, in font-uni:

forget about that one, although it's called unicode, it's actually a 
mechanism for
the many vectors derived from unicode / related to unicode but not entirely 
i.e. cjk fonts

>\def\enableunicodefont#1%
>   {\definefontsynonym[\s!Unicode][\getvalue{\??uc#1\c!file}]%
>    \def\unicodescale             {\getvalue{\??uc#1\c!schaal}}%
>    \def\unicodeheight            {\getvalue{\??uc#1\c!hoogte}}%
>    \def\unicodedepth             {\getvalue{\??uc#1\c!diepte}}%
>    \def\unicodedigits            {\getvalue{\??uc#1\c!conversie}}%
>    \def\handleunicodeglyph       {\getvalue{\??uc#1\c!commando}}%
>%%%%%%%%%%% NEXT LINE
>    \enableregime[unicode]% the following \relax's are realy needed
>    \doifvalue{\??uc#1\c!interlinie}\v!ja\setupinterlinespace\relax
>    \getvalue{\??uc#1\c!commandos}\relax}
>
>The \enableregime[unicode] runs in direct opposition with the
>\enableregime[utf] that normally goes at the start of (some of my)
>documents. As it stands, with the regime hard-coded, users have to put an
>\enableregime[utf] *after* the font declaration. That's awkward.

so, don't use that mechanism, stick to the utf mechanism

>The last proposed change/complaint is back in unic-ini, and came from my
>attempts to match the main body font with the unicode font.
>
>\def\utfunifontglyph#1%
>   {\xdef\unidiv{\number\utfdiv{#1}}%
>    \xdef\unimod{\number\utfmod{#1}}%
>    \ifnum#1<\utf@i
>%%%% \unicodeasciicharacter\unimod
>      \char\unimod % \unicodeascii\unimod
>    \else\ifcsname\@@univector\unidiv\endcsname
>      \csname\doutfunihash{\unidiv}{#1}\endcsname
>    \else % so, these can be different fonts !
>      \unicodeglyph\unidiv\unimod % no \uchar (yet)
>    \fi\fi}
>
>Basically, I'd like to use the \unicodeasciicharacter hook with this
>definition:
>
>\def\unicodeasciicharacter{\uchar{0}}
>
>(I'm not certain the above is release-quality code, but I've been testing
>it with a stripped down \utfunifontglyph that should be functionally
>equivalent.)

play with it and we'll see

>Working with the unicode code makes me appreciate that it's really
>powerful part of ConTeXt. Thanks, Hans!

how about the following:

there are many font encodings around but none is really complete enough to 
deal with basic unicode (0/1/2 range)

why not define a new font encoding with characters only so that we can have 
as many chars as needed in a 0-255 vector, all those
special characters (registered, and so) are (1) used seldom, (2) not 
related to hyphenation and kerning; it is also a way to get
rid of some 'ligatures' like --- becoming an emdash (in context and xml we 
can conformtably directly call symbols, and these may
come from a different instance of the font

Hans   

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: unicode and out-of-box usability
  2004-01-03 22:38 ` Hans Hagen
@ 2004-01-05 13:43   ` Adam Lindsay
  2004-01-05 14:23   ` My Way: Unicode Symbols Adam Lindsay
  1 sibling, 0 replies; 10+ messages in thread
From: Adam Lindsay @ 2004-01-05 13:43 UTC (permalink / raw)


Hi Hans. Thanks for the reply.

Hans Hagen said this at Sat, 3 Jan 2004 23:38:02 +0100:

>>In unic-ini:
>>\chardef\utfunihashmode=0 % 1 = enabled
>>
>>Actually, if I understand things correctly, '1' means "disabled", which
>>is what I preferred, having not yet created any unicode vectors. So the
>>internal documentation there seems wrong, and I would argue the default
>>case (0) makes it harder for beginners.
>
>hm, did you look at the unic-001 etc files? the trick is in fast and
efficient
>expansion without the need to define lots of named glyphs

I looked at them, but perhaps I didn't "get" it. What I saw was precisely
lots and lots of named glyphs in these hashes. I was setting some greek
and cyrillic text, so was dealing with glyphs that would be described in
a non-existent unic-003, unic-004 and other vectors.

Basically, with utfunihashmode set to zero, I saw lots and lots of black
rectangles, even though I had correctly defined and installed the unicode
fonts. When I disabled it by setting it to one, it worked. That seemed
confusing to a beginner for two reasons. (documentation & not looking for
fonts when characters are not in a hash).

>>More confusingly, in font-uni:
>
>forget about that one, although it's called unicode, it's actually a 
>mechanism for
>the many vectors derived from unicode / related to unicode but not entirely 
>i.e. cjk fonts

As you know, I'm somewhat unhealthily obsessed with fonts, and the
\defineunicodefont mechanism seemed to be the only one that 1) allowed
multiple unicode fonts to be defined, and 2) made an effort to
synchronise with document styles. This becomes especially sensitive with
typesetting things like Vietnamese, where you're sliding between named
glyphs that fall into a "normal" encoding vector and glyphs that are in
unicode space. 

I might be wrong on those points, but it seemed as though the UTF-8
mechanism assumed only one font. Japanese can bold things for emphasis,
Greek can be in italic, and there are both Sans and Serif fonts for very
many of the world's languages. On a first pass, the defineunicodefont
mechanism (aside from the \useregime[unicode]) worked fairly well for
those purposes.

>>\def\unicodeasciicharacter{\uchar{0}}
>>
>>(I'm not certain the above is release-quality code, but I've been testing
>>it with a stripped down \utfunifontglyph that should be functionally
>>equivalent.)
>
>play with it and we'll see

I've been playing with it. It works for me. Wouldn't publicly suggest it
otherwise. :)

>>Working with the unicode code makes me appreciate that it's really
>>powerful part of ConTeXt. Thanks, Hans!
>
>how about the following:
>
>there are many font encodings around but none is really complete enough to 
>deal with basic unicode (0/1/2 range)
>
>why not define a new font encoding with characters only so that we can have 
>as many chars as needed in a 0-255 vector, all those
>special characters (registered, and so) are (1) used seldom, 

Okay, I think I get what you're saying, a hyper-dense letter-only
encoding. One of the reasons I got into your support for Unicode was that
"seldom"-used is very relative term, and requires a prioritisation of
languages, and that a vector of 256 glyphs just wasn't going to cut it. I
got tired of hand-rolling encodings, and wanted to use someone else's
hard work for a while. :)

Another reason is that Eddie Kohler's OpenType tools uses /.notdefs in an
encoding as a place for extended glyphs like swashes and ligatures. I'm
rather fond of those features, and so such a dense encoding is not that
useful to me.

>(2) not 
>related to hyphenation and kerning; it is also a way to get
>rid of some 'ligatures' like --- becoming an emdash (in context and xml we 
>can conformtably directly call symbols, and these may
>come from a different instance of the font

So you would make all punctuation active, essentially? Interesting...

Here's another question for folks working in non-latin languages: how
does hyphenation work?
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

* My Way: Unicode Symbols
  2004-01-03 22:38 ` Hans Hagen
  2004-01-05 13:43   ` Adam Lindsay
@ 2004-01-05 14:23   ` Adam Lindsay
  2004-01-05 19:20     ` Adam Lindsay
  2004-01-05 20:12     ` Hans Hagen
  1 sibling, 2 replies; 10+ messages in thread
From: Adam Lindsay @ 2004-01-05 14:23 UTC (permalink / raw)


Hi everyone.

There's a new My Way up on my site. It discusses Unicode Symbols and
their support in ConTeXt. Also available for download is a symb-uni file
that accompanies it. I hope to have a Mac-specific support package up on
the site soon, as well as one that discusses Unicode more generally.
<http://homepage.mac.com/atl/tex/>
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My Way: Unicode Symbols
  2004-01-05 14:23   ` My Way: Unicode Symbols Adam Lindsay
@ 2004-01-05 19:20     ` Adam Lindsay
  2004-01-05 20:12     ` Hans Hagen
  1 sibling, 0 replies; 10+ messages in thread
From: Adam Lindsay @ 2004-01-05 19:20 UTC (permalink / raw)


For those who are curious, a demo doc is up on the site as well,
visualising as many of the symbols from symb-uni that I could find...
<http://homepage.mac.com/atl/tex/AppleSymbolsDemo.pdf>

Adam Lindsay said this at Mon, 5 Jan 2004 14:23:28 +0000:

>Hi everyone.
>
>There's a new My Way up on my site. It discusses Unicode Symbols and
>their support in ConTeXt. Also available for download is a symb-uni file
>that accompanies it. I hope to have a Mac-specific support package up on
>the site soon, as well as one that discusses Unicode more generally.
><http://homepage.mac.com/atl/tex/>
>-- 
>=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Adam T. Lindsay                      atl@comp.lancs.ac.uk
> Computing Dept, Lancaster University   +44(0)1524/594.537
> Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>_______________________________________________
>ntg-context mailing list
>ntg-context@ntg.nl
>http://www.ntg.nl/mailman/listinfo/ntg-context

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My Way: Unicode Symbols
  2004-01-05 14:23   ` My Way: Unicode Symbols Adam Lindsay
  2004-01-05 19:20     ` Adam Lindsay
@ 2004-01-05 20:12     ` Hans Hagen
  2004-01-06 12:22       ` Adam Lindsay
  2004-01-11 11:30       ` Adam Lindsay
  1 sibling, 2 replies; 10+ messages in thread
From: Hans Hagen @ 2004-01-05 20:12 UTC (permalink / raw)


Hi Adam,

When reading your my way i noticed a few things:

(1) sometimes too wide verbatim xml lines can better be handled as follows:

\starttext

\showXMLwrd[oeps]

\startbuffer
<text>
   <example>
     test <oeps/> test
   </example>
</text>
\stopbuffer

\showXMLbuffer

\stoptext

There is also an inline variant:

test test test test test \showXMLtext {<text> <example> test <oeps/> test
</example> </text>} test test test test

and of course a \showXMLfile{name.xml}

The advantage of this method is that too wide lines are handled better i.e. 
they are properly indented

(2) When formatting an ascii text (i.e. the input source), instead of

  test \test{test[test] test{test}} test

one can say:

  test \test {test [test] test {test}} test

which wraps nicer. So, one may have a space after a \command, and between 
{one} {two} arguments (few
exceptions) etc.

(3) instead of \input symb-uni you can use \usesymbols[uni]

(4) I've just added \startxtyping and \startxxtyping to the style

\definetyping[xtyping] [style=\ttx]
\definetyping[xxtyping][style=\ttxx]

Hans




Hans

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My Way: Unicode Symbols
  2004-01-05 20:12     ` Hans Hagen
@ 2004-01-06 12:22       ` Adam Lindsay
  2004-01-06 14:01         ` Hans Hagen
  2004-01-11 11:30       ` Adam Lindsay
  1 sibling, 1 reply; 10+ messages in thread
From: Adam Lindsay @ 2004-01-06 12:22 UTC (permalink / raw)


Hans Hagen said this at Mon, 5 Jan 2004 21:12:16 +0100:

>and of course a \showXMLfile{name.xml}
>
>The advantage of this method is that too wide lines are handled better i.e. 
>they are properly indented

Thanks, Hans.
I tried that a little following your suggestion. It has some trouble with
'$' in the input file, surrounding each one with \bgroup \egroup. It
turns this input:
<xsl:value-of select="map[@charValue=$char-value]/@charName"/> 
  ...into this output...
<xsl:value-of select="map[@charValue=\bgroup \$\egroup char-value]/
@charName"/> 

XSLT is also kind of special in that changes in whitespace really affect
the output, so people who copy & paste the script (or otherwise try to
learn from the typeset version) might not have very good results. (Less
of a problem because I distribute the script, this time.)

The $ mishandling probably is a bug (using Beta from 18 Dec).

The other suggestions are great. I'll see what I can do to refine it.

adam
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My Way: Unicode Symbols
  2004-01-06 12:22       ` Adam Lindsay
@ 2004-01-06 14:01         ` Hans Hagen
  2004-01-07 15:26           ` Adam Lindsay
  0 siblings, 1 reply; 10+ messages in thread
From: Hans Hagen @ 2004-01-06 14:01 UTC (permalink / raw)


At 13:22 06/01/2004, you wrote:

>The $ mishandling probably is a bug (using Beta from 18 Dec).

can you try:

   \chardef\XMLtokensreduction\plustwo

and see if it's better then?

Hans  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My Way: Unicode Symbols
  2004-01-06 14:01         ` Hans Hagen
@ 2004-01-07 15:26           ` Adam Lindsay
  0 siblings, 0 replies; 10+ messages in thread
From: Adam Lindsay @ 2004-01-07 15:26 UTC (permalink / raw)


Hans Hagen said this at Tue, 6 Jan 2004 15:01:30 +0100:

>>The $ mishandling probably is a bug (using Beta from 18 Dec).
>
>can you try:
>
>   \chardef\XMLtokensreduction\plustwo
>
>and see if it's better then?

Nope. That one doesn't work. Still get:
select="map[@charValue = \bgroup \$\egroup char-value]"

And this is definitely outside of the ConTeXt bits that I understand.

adam
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My Way: Unicode Symbols
  2004-01-05 20:12     ` Hans Hagen
  2004-01-06 12:22       ` Adam Lindsay
@ 2004-01-11 11:30       ` Adam Lindsay
  1 sibling, 0 replies; 10+ messages in thread
From: Adam Lindsay @ 2004-01-11 11:30 UTC (permalink / raw)


Hans Hagen said this at Mon, 5 Jan 2004 21:12:16 +0100:

>When reading your my way i noticed a few things:

Hi all,
FWIW, the Unicode Symbols My Way is now updated according to comments and
fixes from Hans. The accompanying symbols demo and symb-uni package also
have minor changes, breaking up the largest symbolsets into slightly more
manageable chunks.

<http://homepage.mac.com/atl/tex/>

adam
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay                      atl@comp.lancs.ac.uk
 Computing Dept, Lancaster University   +44(0)1524/594.537
 Lancaster, LA1 4YR, UK             Fax:+44(0)1524/593.608
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-01-11 11:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-02 17:59 unicode and out-of-box usability Adam Lindsay
2004-01-03 22:38 ` Hans Hagen
2004-01-05 13:43   ` Adam Lindsay
2004-01-05 14:23   ` My Way: Unicode Symbols Adam Lindsay
2004-01-05 19:20     ` Adam Lindsay
2004-01-05 20:12     ` Hans Hagen
2004-01-06 12:22       ` Adam Lindsay
2004-01-06 14:01         ` Hans Hagen
2004-01-07 15:26           ` Adam Lindsay
2004-01-11 11:30       ` Adam Lindsay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).