From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.tex.context/9968 Path: main.gmane.org!not-for-mail From: Taco Hoekwater Newsgroups: gmane.comp.tex.context Subject: Re: DocBookInContext & multi-languages (newbie) / utf Date: Mon, 2 Dec 2002 17:36:01 +0100 Organization: Elvenkind Sender: ntg-context-admin@ntg.nl Message-ID: <20021202173601.4c5ef34c.taco@elvenkind.com> References: <5.1.0.14.1.20021202132549.00af0328@server-1> <20021129072039.GA7792@mail.inet.hr> <20021129072039.GA7792@mail.inet.hr> <5.1.0.14.1.20021202132549.00af0328@server-1> <5.1.0.14.1.20021202153537.03d21258@server-1> Reply-To: ntg-context@ntg.nl NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: main.gmane.org 1038847096 30694 80.91.224.249 (2 Dec 2002 16:38:16 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Mon, 2 Dec 2002 16:38:16 +0000 (UTC) Cc: pragma@wxs.nl Return-path: Original-Received: from ref.vet.uu.nl ([131.211.172.13] helo=ref.ntg.nl) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 18ItXb-0007nm-00 for ; Mon, 02 Dec 2002 17:35:27 +0100 Original-Received: from ref.ntg.nl (localhost.localdomain [127.0.0.1]) by ref.ntg.nl (Postfix) with ESMTP id 755FA10AF9; Mon, 2 Dec 2002 17:37:45 +0100 (MET) Original-Received: from glenfiddich.elvenkind.com (elvenknd.xs4all.nl [213.84.171.68]) by ref.ntg.nl (Postfix) with ESMTP id E7A7310AE8 for ; Mon, 2 Dec 2002 17:36:01 +0100 (MET) Original-Received: from glenlivet.elvenkind.com (glenlivet.elvenkind.com [10.10.0.6]) by glenfiddich.elvenkind.com (Postfix) with SMTP id BB3B55E93; Mon, 2 Dec 2002 17:27:54 +0100 (CET) Original-To: ntg-context@ntg.nl In-Reply-To: <5.1.0.14.1.20021202153537.03d21258@server-1> X-Mailer: Sylpheed version 0.8.5claws (GTK+ 1.2.10; i586-mandrake-linux-gnu) Errors-To: ntg-context-admin@ntg.nl X-BeenThere: ntg-context@ntg.nl X-Mailman-Version: 2.0.13 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: mailing list for ConTeXt users List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.comp.tex.context:9968 X-Report-Spam: http://spam.gmane.org/gmane.comp.tex.context:9968 UTF8 encoding is rather simple, really: byte number: b1 b2 b3 b4 0 -- 127 = unicode 0x00 - 0x7F 192 -- 223 128 -- 191 = unicode 0x80 - 0x7FF 224 -- 239 128 -- 191 128 -- 191 = unicode 0x800 - 0xFFFF 240 -- 247 128 -- 191 128 -- 191 128 -- 191 = unicode 0x10000 - 0x1FFFF There are also sequences for 5 and 6 bytes, but these are illegal for Unicode representations at the moment: 248 -- 251 128 -- 191 128 -- 191 128 -- 191 128 -- 191 252 -- 253 128 -- 191 128 -- 191 128 -- 191 128 -- 191 128 -- 191 128 -- 191 are illegal as first chars in UTF8 (that is handy for error-recovery): 254 and 255 are completely illegal and should not appear at all (if you see them, it's a safe bet that the document is encoded as UTF16, not UTF8): The unicode number for a UTF8 sequence can be calculated as: byte1 if byte1 <= 127 (byte1-192)*64 + (byte2-128) if 192 <= byte1 <= 223 (byte1-224)*4096 + (byte2-128)*64 + (byte3-128) if 224 <= byte1 <= 239 (byte3-240)*262144 + (byte2-128)*4096 + (byte3-128)*64 + (byte4-128) if 240<= byte1 <= 247 Simple, eh? -- groeten, Taco