ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
* (Con)TeX(t), Unicode and accented characters
@ 2004-12-20 20:02 Mojca Miklavec
  2004-12-20 20:52 ` Hans Hagen
  2004-12-21  7:56 ` r.ermers
  0 siblings, 2 replies; 6+ messages in thread
From: Mojca Miklavec @ 2004-12-20 20:02 UTC (permalink / raw)



Here's a short version of my question:

How do I enable unicode encoded characters (just normal accented latin 
characters) to be typeset in (any font) in ConTeXt, like the 
\usepackage[utf8]{inputenc} in LaTeX?

And here the long one:

************************************************************************

I don't really understand how accented characters are typeset in 
(Con)TeX(t). One of the main reasons for switching to LaTeX (maybe 8 
years ago) someone mentioned was: "You don't have to worry about 
accented characters. You can make any accented character and it will 
work all over the world." (We actually did have lots of problems with MS 
Word and web browsers at that time.) And it was true.

But when I switched to ConTeXt I came against that problem again.

In LaTeX I used
     \v{c}\v{s}\v{z}
at first, later
     \usepackage{csz} ... "c"s"z
(which works pretty much the same as "a"o"u in German)
and finally (when someone told me about that possibility)
     \usepackage[utf8]{inputenc} ... čšž

As I didn't know how to use any other the font, I always used CMR, the 
default, so I didn't have problems with exotic fonts either.

************************************************************************

But here we come to ConTeXt.
For the German "Umlaut", \"{a}\"{o}\"{u} (äöü), this was satisfactory:

     \useencoding[windows-1250]
     \mainlanguage[de]

For \v{c}\v{s}\v{z} (čšž) this wasn't the case, so a proposed solution 
from another ConTeXt user was:

     % output=pdf -translate-file=cp1250cs
     \setupbodyfont
         [csr,ams,rm]

What I don't really understand: why did the Chech TUG have to design 
*their own font*, csr, (or made changes to cmr) if accented characters 
worked perfectly already in plain TeX?

The second problem: This works under Windows when typesetting in code 
page 1250. How can I use accented characters if text is typeset in 
Unicode (or latin2) in Linux?

The third problem: How do I typeset '\v{c}' in some other font? I do 
understand that it may not function in just any font since someone has 
to tell the computer how the accented characters are built, but as long 
as \v{c} works, there's no reason for
     \useencoding[utf8]
and then continuing with unicode encoded characters not to produce the 
desired result.

Thank you,
     Mojca

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: (Con)TeX(t), Unicode and accented characters
  2004-12-20 20:02 (Con)TeX(t), Unicode and accented characters Mojca Miklavec
@ 2004-12-20 20:52 ` Hans Hagen
  2004-12-20 21:35   ` Mojca Miklavec
  2004-12-20 22:16   ` VnPenguin
  2004-12-21  7:56 ` r.ermers
  1 sibling, 2 replies; 6+ messages in thread
From: Hans Hagen @ 2004-12-20 20:52 UTC (permalink / raw)


Mojca Miklavec wrote:

> But when I switched to ConTeXt I came against that problem again.
> 
> In LaTeX I used
>     \v{c}\v{s}\v{z}

this also works in context

> at first, later
>     \usepackage{csz} ... "c"s"z

in this case, i assume that csz makes " active and such; if you really want that 
, we shoul dmake an enco-fcz, with definitions like:

\startlanguagespecifics[cz]

   \appendtoks \makecharacteractive " \to \everynormalcatcodes

   \installcompoundcharacter "c {\v{c}}
   \installcompoundcharacter "s {\v{s}}
   \installcompoundcharacter "z {\v{z}}

\stoplanguagespecifics

and alike; if you want utf, you should say (at the top of the file)

\enableregime[utf]

> As I didn't know how to use any other the font, I always used CMR, the 
> default, so I didn't have problems with exotic fonts either.

this should work with all fonts, since there are fallback definitions

>     % output=pdf -translate-file=cp1250cs
>     \setupbodyfont
>         [csr,ams,rm]

try to avoid code pages

> What I don't really understand: why did the Chech TUG have to design 
> *their own font*, csr, (or made changes to cmr) if accented characters 
> worked perfectly already in plain TeX?

in cmr \v{s} is actually two characters, while in csr it's one (composed) 
character (built of two characters but seen as one); therefore when you use csr 
fonts, you can get proper hyphenation (which is notthe case in cmr where the 
usage of \accent primitive spoils the game);

next year, when i can assume that the new latin modern fonts are available 
everywhere, i will drop cmr as default cum suis in favor of lsr (which has cmr, 
plr, csr, vnr, aer etc included)

> The second problem: This works under Windows when typesetting in code 
> page 1250. How can I use accented characters if text is typeset in 
> Unicode (or latin2) in Linux?

you probably need to configure you reditor to use utf

> The third problem: How do I typeset '\v{c}' in some other font? I do 
> understand that it may not function in just any font since someone has 
> to tell the computer how the accented characters are built, but as long 
> as \v{c} works, there's no reason for
>     \useencoding[utf8]
> and then continuing with unicode encoded characters not to produce the 
> desired result.

don't worry, other fonts work ok; if an encoding does not support the chars you 
need, a composed char is constructed; [font encodings have othing to do with 
input encoding but there do influence hyphenations]

if i'm right, ec, texnansi, and qx encoding all serve your purpose

Hans

-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
      tel: 038 477 53 69 | fax: 038 477 53 74 | www.pragma-ade.com
                                              | www.pragma-pod.nl
-----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: (Con)TeX(t), Unicode and accented characters
  2004-12-20 20:52 ` Hans Hagen
@ 2004-12-20 21:35   ` Mojca Miklavec
  2004-12-20 22:16   ` VnPenguin
  1 sibling, 0 replies; 6+ messages in thread
From: Mojca Miklavec @ 2004-12-20 21:35 UTC (permalink / raw)




Hans Hagen wrote:

> and alike; if you want utf, you should say (at the top of the file)
> 
> \enableregime[utf]

Thanks for many other advices also, but especially for this one: I 
probably already tried this out. Well, almost ;). Since niether 
\enableregime[utf8] nor \enableregime[utf-8] resulted in the desired 
output. (I was always used to write '8' after utf since utf-16 and some 
others exist as well.)

Thank you, Mojca

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: (Con)TeX(t), Unicode and accented characters
  2004-12-20 20:52 ` Hans Hagen
  2004-12-20 21:35   ` Mojca Miklavec
@ 2004-12-20 22:16   ` VnPenguin
  1 sibling, 0 replies; 6+ messages in thread
From: VnPenguin @ 2004-12-20 22:16 UTC (permalink / raw)


On Mon, 20 Dec 2004 21:52:17 +0100, Hans Hagen <pragma@wxs.nl> wrote:
> Mojca Miklavec wrote:
[...]
> > The second problem: This works under Windows when typesetting in code
> > page 1250. How can I use accented characters if text is typeset in
> > Unicode (or latin2) in Linux?
> 
> you probably need to configure you reditor to use utf

Under Linux I use vim/gvim, gedit, gtk2edit for editing Vietnamese
text in UTF-8 without any problem :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: (Con)TeX(t), Unicode and accented characters
  2004-12-20 20:02 (Con)TeX(t), Unicode and accented characters Mojca Miklavec
  2004-12-20 20:52 ` Hans Hagen
@ 2004-12-21  7:56 ` r.ermers
  2004-12-21 10:01   ` Adam Lindsay
  1 sibling, 1 reply; 6+ messages in thread
From: r.ermers @ 2004-12-21  7:56 UTC (permalink / raw)


Mojca,

In reply to your question:

> I don't really understand how accented characters are typeset in
> (Con)TeX(t). One of the main reasons for switching to LaTeX (maybe 8
> years ago) someone mentioned was: "You don't have to worry about accented
> characters. You can make any accented character and it will work all over
> the world." (We actually did have lots of problems with MS Word and web
> browsers at that time.) And it was true.

You know that all characters in a font have a number. If you type a, the font mechanism makes sure that you see an a. In reality the font shows you the character that is put on the numerical position of a. In the font dingbats for example, the character on that position is not an a, but a symbol.

In Latex the combination \"{a} can mean two things:
1. in most fonts: show the charachter on the a given numerical position, which means that there is one character ä.

2. in some other fonts \"{a} means: combine " with a and make an ä. This means that " is combined with the character on the numerical position of a. TeX does this very well and thus construes very acceptable diacritical signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts.

If you have a font which contains \"{q}, \d{o} or some other special characters, you may instruct TeX not to create the character, but rather to show the contents of a given numerical position in that font. That's what the .enc and .fd files under Latex are for.

That's also the reason there are, or used to be, special fonts for Polish an Czech and other languages: they contain predefined characters in one single numerical position, e.g. \v{s} and \v{c} that TeX does not have to create anew from two signs.

Kind regards,

Robert

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: (Con)TeX(t), Unicode and accented characters
  2004-12-21  7:56 ` r.ermers
@ 2004-12-21 10:01   ` Adam Lindsay
  0 siblings, 0 replies; 6+ messages in thread
From: Adam Lindsay @ 2004-12-21 10:01 UTC (permalink / raw)


r.ermers@hccnet.nl said this at Tue, 21 Dec 2004 08:56:40 +0100:

>In Latex the combination \"{a} can mean two things:
>1. in most fonts: show the charachter on the a given numerical position,
>which means that there is one character ä.
>
>2. in some other fonts \"{a} means: combine " with a and make an ä. This
>means that " is combined with the character on the numerical position of
>a. TeX does this very well and thus construes very acceptable diacritical
>signs like \"{q}, \d{o}, \v{o}, which do not exist in regular fonts.

Robert,

That's a helpful explanation. I'll try to expand on that in the ConTeXt
case, just in case people are curious or are led into thinking it's just
the same:

In ConTeXt, the combination \"{a} means one thing: \adiaeresis (see enco-
acc). This \adiaeresis can mean one of two things, depending on the encoding:
1. Numerical position, or
2. The fallback case (defined in enco-def), where a diaeresis/umlaut is
placed atop an 'a' glyph. Hyphenation implications as Hans described.

The interesting/helpful thing about ConTeXt is that internally, that
glyph is given a consistent name, no matter how it is input or output.
So, if you type ä in your given input regime, and that encoding is
properly set, that numerical ä (e.g., character #228 in the windows
regime) is mapped to \adiaeresis.

Wanna know what happens in UTF-8? Here's my 'simplified' explanation:
In a UTF-8 bytestream, that character "ä" is signified by two bytes:
0xC3, 0xA4. That first byte triggers a conversion of both bytes into two
different bytes, the actual Unicode number, 0x00 0xE4 (or: 0, 228).
ConTeXt then looks into internal hashes set up (in this case, the unic-
000 vector), looks at the 228th element, and sees that it's \adiaeresis.
Things then proceed as normal. :)

(It's also interesting to note that for PostScript and TrueType fonts,
that number > name > number (glyph) mapping happens yet again in the
driver. But all that is outside of TeX proper, so to say any more would
be confusing.)
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay, Computing Dept.     atl@comp.lancs.ac.uk
 Lancaster University, InfoLab21        +44(0)1524/510.514
 Lancaster, LA1 4WA, UK             Fax:+44(0)1524/510.492
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-12-21 10:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-20 20:02 (Con)TeX(t), Unicode and accented characters Mojca Miklavec
2004-12-20 20:52 ` Hans Hagen
2004-12-20 21:35   ` Mojca Miklavec
2004-12-20 22:16   ` VnPenguin
2004-12-21  7:56 ` r.ermers
2004-12-21 10:01   ` Adam Lindsay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).