ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: Taco Hoekwater <taco@elvenkind.com>
Subject: Re: UTF-8
Date: Wed, 08 Aug 2001 14:03:50 +0200	[thread overview]
Message-ID: <3B712AA6.7335CC12@elvenkind.com> (raw)
In-Reply-To: <20010808120630.A526@localhost>

Marco Kuhlmann wrote:
> I could try to convert the ucs package to ConTeXt (I need it
> for a documentation project) if you pointed me at the right
> direction how to do it.

The mapping to commands is the problem. Parse-ing UTF-8 is not hard
(see below). 

But there are a *lot* of commands that need to be mapped and keying 
those in is a boring exercise at best. 

My personal approach is based on converting to a 16-bit UCS number 
(yes, that should be changed now for Unicode 3.1), where quite large 
chunks are later flagged as 'unsupported'. 

Most of the characters need special attention in the remapping
to commands, which is a big problem area. 

Anyway, here is my UTF-8 Parser. I'm only showing it to
see how the calculations should be done. This code will *not* run.

\newcount\UCS

%D \defchar is just a temporary macro. Explanation is below

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1{%
    \UCS=#1 \advance\UCS-192 \multiply \UCS64
        \scratchcounter=`##1\advance\scratchcounter-128
        \advance\UCS\scratchcounter
        \message{\the\UCS}}}

\processcommalist[194,195,196,197,198,199,200,201,202,203,204,%
                  205,206,207,208,209,210,211,212,213,214,215,%
                  216,217,218,219,220,221,222,223]\defchar 

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2{%
    \UCS=#1 % e.g. 225
    \advance\UCS-224   % start of sequence is 244, so now \UCS = 1
    \multiply \UCS4096 % multiply by (number of arguments*6 bits), in this
case
                       % that gives 64*64.
   \scratchcounter=`##1        % first arg is always higher than 127
   \advance\scratchcounter-128 % so go back to real counter value
   \multiply\scratchcounter64  % which occupies 6 bits
   \advance\UCS\scratchcounter % add to total
   \scratchcounter=`##2        % process repeats itself
   \advance\scratchcounter-128 
   \advance\UCS\scratchcounter
   \message{\the\UCS}}}

\processcommalist[224,225,226,227,228,229,230,231,232,233,234,%
                  235,236,237,238,239]\defchar

%D Here is where I stop trying to be reasonable but the general
%D approach is the same as for the two blocks above.
%D 

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2##3{}}
\processcommalist[240,241,242,243,244,245,246,247]\defchar

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2##3##4{}}
\processcommalist[248,249,250,251]\defchar

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname##1##2##3##4##5{}}
\processcommalist[252,253]\defchar

%D And these are outright invalid input.

\def\defchar#1{\expandafter\def\csname UTF8-#1\endcsname{\message{Illegal
character}}}

\processcommalist[129,130,131,132,133,134,135,136,137,138,139,140,
    141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,
    157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,
    173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,
    189,190,191]\defchar  

%D missing are 254 and 255: these are used as UTF-16 flags and are not
%D part of UTF-8 encoding.

-- 
groeten,

Taco


  reply	other threads:[~2001-08-08 12:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-08-07  5:42 UTF-8 Marco Kuhlmann
2001-08-08  9:08 ` UTF-8 Hans Hagen
2001-08-08 10:06   ` UTF-8 Marco Kuhlmann
2001-08-08 12:03     ` Taco Hoekwater [this message]
2001-08-08 15:14       ` UTF-8 Hans Hagen
2001-08-08 12:42     ` UTF-8 Hans Hagen
2002-12-07 13:31 utf-8 Simon Pepping
2002-12-08 13:46 ` utf-8 Simon Pepping
2002-12-08 20:28 ` utf-8 Hans Hagen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3B712AA6.7335CC12@elvenkind.com \
    --to=taco@elvenkind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).