From: Taco Hoekwater <taco@elvenkind.com>
Cc: pragma@wxs.nl
Subject: Re: DocBookInContext & multi-languages (newbie) / utf
Date: Mon, 2 Dec 2002 17:36:01 +0100 [thread overview]
Message-ID: <20021202173601.4c5ef34c.taco@elvenkind.com> (raw)
In-Reply-To: <5.1.0.14.1.20021202153537.03d21258@server-1>
UTF8 encoding is rather simple, really:
byte number:
b1 b2 b3 b4
0 -- 127 = unicode 0x00 - 0x7F
192 -- 223 128 -- 191 = unicode 0x80 - 0x7FF
224 -- 239 128 -- 191 128 -- 191 = unicode 0x800 - 0xFFFF
240 -- 247 128 -- 191 128 -- 191 128 -- 191 = unicode 0x10000 - 0x1FFFF
There are also sequences for 5 and 6 bytes, but these are illegal for Unicode
representations at the moment:
248 -- 251 128 -- 191 128 -- 191 128 -- 191 128 -- 191
252 -- 253 128 -- 191 128 -- 191 128 -- 191 128 -- 191 128 -- 191
128 -- 191 are illegal as first chars in UTF8 (that is handy for error-recovery):
254 and 255 are completely illegal and should not appear at all (if you see them,
it's a safe bet that the document is encoded as UTF16, not UTF8):
The unicode number for a UTF8 sequence can be calculated as:
byte1 if byte1 <= 127
(byte1-192)*64 + (byte2-128) if 192 <= byte1 <= 223
(byte1-224)*4096 + (byte2-128)*64 + (byte3-128) if 224 <= byte1 <= 239
(byte3-240)*262144 + (byte2-128)*4096 + (byte3-128)*64 + (byte4-128)
if 240<= byte1 <= 247
Simple, eh?
--
groeten,
Taco
next prev parent reply other threads:[~2002-12-02 16:36 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-11-29 7:20 DocBookInContext & multi-languages (newbie) Gour
2002-11-29 19:18 ` Simon Pepping
2002-11-30 20:15 ` Gour
2002-11-30 20:55 ` Bruce D'Arcus
2002-12-01 6:40 ` Gour
2002-12-02 19:46 ` Simon Pepping
2002-12-02 20:30 ` Tobias Burnus
2002-12-02 21:54 ` Hans Hagen
[not found] ` <Pine.LNX.4.44.0212022106550.2205-100000@tom.physik.fu-berl in.de>
2002-12-02 21:59 ` Hans Hagen
2002-12-03 12:48 ` Tobias Burnus
2002-12-03 13:59 ` Willi Egger
[not found] ` <Pine.LNX.4.44.0212031306170.23965-100000@warp9.physik.fu-b erlin.de>
2002-12-03 13:45 ` Hans Hagen
2002-12-02 12:28 ` DocBookInContext & multi-languages (newbie) / utf Hans Hagen
2002-12-02 13:59 ` Gour
2002-12-02 14:43 ` Hans Hagen
2002-12-02 16:36 ` Taco Hoekwater [this message]
2002-12-02 17:40 ` Gour
2002-12-02 20:16 ` Simon Pepping
2002-12-02 21:57 ` Hans Hagen
2002-12-03 20:03 ` Simon Pepping
2002-12-03 23:31 ` Hans Hagen
2002-12-04 14:10 ` Gour
2002-12-04 16:31 ` Hans Hagen
2002-12-04 20:08 ` Gour
2002-12-05 0:10 ` multi-languages [UTF-8 Roman and UTF-8 Nagari test files] Richard Mahoney
2002-12-05 11:58 ` DocBookInContext & multi-languages (newbie) / utf Hans Hagen
2002-12-05 12:22 ` Taco Hoekwater
2002-12-05 13:25 ` Hans Hagen
2002-12-05 14:03 ` Tobias Burnus
2002-12-05 19:09 ` Create Type 1 fonts with Indological diacritics and UTF-8 TTF Richard Mahoney
2002-12-06 14:10 ` Hans Hagen
2002-12-06 15:22 ` Docu set Michael Hallgren
2002-12-07 14:12 ` Patrick Gundlach
2002-12-07 17:37 ` Michael Hallgren
2002-12-06 15:36 ` Re: Create Type 1 fonts with Indological diacritics and UTF-8 TTF Gour
2002-12-06 16:47 ` Hans Hagen
2002-12-03 19:14 ` DocBookInContext ... [CSX+, UTF-8 Roman, and Nagari Codings] Richard Mahoney
2002-12-04 14:16 ` Gour
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20021202173601.4c5ef34c.taco@elvenkind.com \
--to=taco@elvenkind.com \
--cc=ntg-context@ntg.nl \
--cc=pragma@wxs.nl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).