UTF-8

ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed

* UTF-8
@ 2001-08-07  5:42 Marco Kuhlmann
  2001-08-08  9:08 ` UTF-8 Hans Hagen
  0 siblings, 1 reply; 9+ messages in thread
From: Marco Kuhlmann @ 2001-08-07  5:42 UTC (permalink / raw)


Is there an input regime for UTF-8? I had a look in the
distribution but only found regi-uni.tex, which seems to
fulfill some other purpose (which?).

    Regards,
    Marco    

-- 
Marco Kuhlmann                             marco.kuhlmann@gmx.net


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8
  2001-08-07  5:42 UTF-8 Marco Kuhlmann
@ 2001-08-08  9:08 ` Hans Hagen
  2001-08-08 10:06   ` UTF-8 Marco Kuhlmann
  0 siblings, 1 reply; 9+ messages in thread
From: Hans Hagen @ 2001-08-08  9:08 UTC (permalink / raw)
  Cc: ConTeXt ML

At 07:42 AM 8/7/2001 +0200, Marco Kuhlmann wrote:
>Is there an input regime for UTF-8? I had a look in the
>distribution but only found regi-uni.tex, which seems to
>fulfill some other purpose (which?).

the unicode support build into context is used for chinese (input) and pdf 
(output),

in enco-uc you cen see a mapping from named glyph to raw ucode's [happens 
in pdf vectors for searching purposes]

the other way around is supported [there's a lot of low level mapping 
available], but simply not yet defined, so if you give me specs, i can look 
into it.

Hans
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8
  2001-08-08  9:08 ` UTF-8 Hans Hagen
@ 2001-08-08 10:06   ` Marco Kuhlmann
  2001-08-08 12:03     ` UTF-8 Taco Hoekwater
  2001-08-08 12:42     ` UTF-8 Hans Hagen
  0 siblings, 2 replies; 9+ messages in thread
From: Marco Kuhlmann @ 2001-08-08 10:06 UTC (permalink / raw)


Hans Hagen wrote (2001-08-08 (11:08)):

> the other way around is supported [there's a lot of low level
> mapping available], but simply not yet defined, so if you give
> me specs, i can look into it.

There is a package ucs for LaTeX (on CTAN) which enables UTF-8
as input encoding (with \usepackage[UTF-8]{inputenc}). It comes
with a quite extensive mapping from UTF-8 to LaTeX commands,
even incorporating add-on packages (like for phonetic script,
for example). It would be nice to have something like this for
ConTeXt as well.

I could try to convert the ucs package to ConTeXt (I need it
for a documentation project) if you pointed me at the right
direction how to do it.

    Marco

-- 
Marco Kuhlmann                             marco.kuhlmann@gmx.net


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8
  2001-08-08 10:06   ` UTF-8 Marco Kuhlmann
@ 2001-08-08 12:03     ` Taco Hoekwater
  2001-08-08 15:14       ` UTF-8 Hans Hagen
  2001-08-08 12:42     ` UTF-8 Hans Hagen
  1 sibling, 1 reply; 9+ messages in thread
From: Taco Hoekwater @ 2001-08-08 12:03 UTC (permalink / raw)


Marco Kuhlmann wrote:
> I could try to convert the ucs package to ConTeXt (I need it
> for a documentation project) if you pointed me at the right
> direction how to do it.

The mapping to commands is the problem. Parse-ing UTF-8 is not hard
(see below). 

But there are a *lot* of commands that need to be mapped and keying 
those in is a boring exercise at best. 

My personal approach is based on converting to a 16-bit UCS number 
(yes, that should be changed now for Unicode 3.1), where quite large 
chunks are later flagged as 'unsupported'. 

Most of the characters need special attention in the remapping
to commands, which is a big problem area. 

Anyway, here is my UTF-8 Parser. I'm only showing it to
see how the calculations should be done. This code will *not* run.

\newcount\UCS

%D \defchar is just a temporary macro. Explanation is below

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1{%
    \UCS=#1 \advance\UCS-192 \multiply \UCS64
        \scratchcounter=`##1\advance\scratchcounter-128
        \advance\UCS\scratchcounter
        \message{\the\UCS}}}

\processcommalist[194,195,196,197,198,199,200,201,202,203,204,%
                  205,206,207,208,209,210,211,212,213,214,215,%
                  216,217,218,219,220,221,222,223]\defchar 

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2{%
    \UCS=#1 % e.g. 225
    \advance\UCS-224   % start of sequence is 244, so now \UCS = 1
    \multiply \UCS4096 % multiply by (number of arguments*6 bits), in this
case
                       % that gives 64*64.
   \scratchcounter=`##1        % first arg is always higher than 127
   \advance\scratchcounter-128 % so go back to real counter value
   \multiply\scratchcounter64  % which occupies 6 bits
   \advance\UCS\scratchcounter % add to total
   \scratchcounter=`##2        % process repeats itself
   \advance\scratchcounter-128 
   \advance\UCS\scratchcounter
   \message{\the\UCS}}}

\processcommalist[224,225,226,227,228,229,230,231,232,233,234,%
                  235,236,237,238,239]\defchar

%D Here is where I stop trying to be reasonable but the general
%D approach is the same as for the two blocks above.
%D 

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2##3{}}
\processcommalist[240,241,242,243,244,245,246,247]\defchar

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2##3##4{}}
\processcommalist[248,249,250,251]\defchar

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2##3##4##5{}}
\processcommalist[252,253]\defchar

%D And these are outright invalid input.

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname{\message{Illegal
character}}}

\processcommalist[129,130,131,132,133,134,135,136,137,138,139,140,
    141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,
    157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,
    173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,
    189,190,191]\defchar  

%D missing are 254 and 255: these are used as UTF-16 flags and are not
%D part of UTF-8 encoding.

-- 
groeten,

Taco


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8
  2001-08-08 10:06   ` UTF-8 Marco Kuhlmann
  2001-08-08 12:03     ` UTF-8 Taco Hoekwater
@ 2001-08-08 12:42     ` Hans Hagen
  1 sibling, 0 replies; 9+ messages in thread
From: Hans Hagen @ 2001-08-08 12:42 UTC (permalink / raw)
  Cc: ConTeXt ML

At 12:06 PM 8/8/2001 +0200, Marco Kuhlmann wrote:
>Hans Hagen wrote (2001-08-08 (11:08)):
>
> > the other way around is supported [there's a lot of low level
> > mapping available], but simply not yet defined, so if you give
> > me specs, i can look into it.
>
>There is a package ucs for LaTeX (on CTAN) which enables UTF-8
>as input encoding (with \usepackage[UTF-8]{inputenc}). It comes
>with a quite extensive mapping from UTF-8 to LaTeX commands,
>even incorporating add-on packages (like for phonetic script,
>for example). It would be nice to have something like this for
>ConTeXt as well.
>
>I could try to convert the ucs package to ConTeXt (I need it
>for a documentation project) if you pointed me at the right
>direction how to do it.

the main thing is to make the table <num><num>

i'll send you a rough starting point

Hans
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: UTF-8
  2001-08-08 12:03     ` UTF-8 Taco Hoekwater
@ 2001-08-08 15:14       ` Hans Hagen
  0 siblings, 0 replies; 9+ messages in thread
From: Hans Hagen @ 2001-08-08 15:14 UTC (permalink / raw)
  Cc: ntg-context

At 02:03 PM 8/8/2001 +0200, Taco Hoekwater wrote:
>Marco Kuhlmann wrote:
> > I could try to convert the ucs package to ConTeXt (I need it
> > for a documentation project) if you pointed me at the right
> > direction how to do it.
>
>The mapping to commands is the problem. Parse-ing UTF-8 is not hard
>(see below).

this kind of code can be popped into the low level uchar translator / 
mapper; since it is a bit what happens with chinese: remapping ranges. I 
still must document is [sigh, i promissed wang lei this already many times[,

Hans
-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf-8
  2002-12-07 13:31 utf-8 Simon Pepping
  2002-12-08 13:46 ` utf-8 Simon Pepping
@ 2002-12-08 20:28 ` Hans Hagen
  1 sibling, 0 replies; 9+ messages in thread
From: Hans Hagen @ 2002-12-08 20:28 UTC (permalink / raw)


At 02:31 PM 12/7/2002 +0100, you wrote:
>The new beta works fine.
>
>\unknownchar must be enclosed in braces, to prevent error messages
>when the unknown char is the only char in a superior or inferior.
>
>Many math symbols (unicode range 0x2200) are still missing. Would they
>be added like this: \definecharacter textSum {\mathematics{\Sum}} ? I
>will experiment somewhat with this later.

since the accented chars and math are kin dof special, those vectors need 
to be filled with named glyphs, you can best look into the math-* files, 
since they are symbols, and need to adapt themselves to the math fonts 
(ams,px,tx,lbr,mt,..)

Hans


-------------------------------------------------------------------------
                                   Hans Hagen | PRAGMA ADE | pragma@wxs.nl
                       Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
  tel: +31 (0)38 477 53 69 | fax: +31 (0)38 477 53 74 | www.pragma-ade.com
-------------------------------------------------------------------------
                        information: http://www.pragma-ade.com/roadmap.pdf
                     documentation: http://www.pragma-ade.com/showcase.pdf
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: utf-8
  2002-12-07 13:31 utf-8 Simon Pepping
@ 2002-12-08 13:46 ` Simon Pepping
  2002-12-08 20:28 ` utf-8 Hans Hagen
  1 sibling, 0 replies; 9+ messages in thread
From: Simon Pepping @ 2002-12-08 13:46 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 474 bytes --]

On Sat, Dec 07, 2002 at 02:31:03PM +0100, Simon Pepping wrote:
> Many math symbols (unicode range 0x2200) are still missing. Would they
> be added like this: \definecharacter textSum {\mathematics{\Sum}} ? I
> will experiment somewhat with this later.

Attached my first efforts. I will work it out later, if there is a
need. How do I define combinations with \not? Can I use the AMS
character names straight away?

Simon

-- 
Simon Pepping
email: spepping@scaprea.hobby.nl

[-- Attachment #2: unic-034.tex --]
[-- Type: application/x-tex, Size: 1687 bytes --]

[-- Attachment #3: uni34test.tex --]
[-- Type: application/x-tex, Size: 534 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* utf-8
@ 2002-12-07 13:31 Simon Pepping
  2002-12-08 13:46 ` utf-8 Simon Pepping
  2002-12-08 20:28 ` utf-8 Hans Hagen
  0 siblings, 2 replies; 9+ messages in thread
From: Simon Pepping @ 2002-12-07 13:31 UTC (permalink / raw)


The new beta works fine.

\unknownchar must be enclosed in braces, to prevent error messages
when the unknown char is the only char in a superior or inferior.

Many math symbols (unicode range 0x2200) are still missing. Would they
be added like this: \definecharacter textSum {\mathematics{\Sum}} ? I
will experiment somewhat with this later.

Simon

-- 
Simon Pepping
email: spepping@scaprea.hobby.nl

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-12-08 20:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-07  5:42 UTF-8 Marco Kuhlmann
2001-08-08  9:08 ` UTF-8 Hans Hagen
2001-08-08 10:06   ` UTF-8 Marco Kuhlmann
2001-08-08 12:03     ` UTF-8 Taco Hoekwater
2001-08-08 15:14       ` UTF-8 Hans Hagen
2001-08-08 12:42     ` UTF-8 Hans Hagen
2002-12-07 13:31 utf-8 Simon Pepping
2002-12-08 13:46 ` utf-8 Simon Pepping
2002-12-08 20:28 ` utf-8 Hans Hagen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).