caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Yamagata Yoriyuki <yoriyuki@mbg.sphere.ne.jp>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
Date: Sun, 01 Sep 2002 20:57:10 +0900 (JST)	[thread overview]
Message-ID: <20020901.205710.118791924.yoriyuki@mbg.sphere.ne.jp> (raw)
In-Reply-To: <3D71D544.4010509@ozemail.com.au>

From: John Max Skaller <skaller@ozemail.com.au>
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
Date: Sun, 01 Sep 2002 18:52:20 +1000

> I have ALL the code sets specified at Unicode.org in
> programmatic form. Easy to generate Ocaml versions
> of the tables.

Data at Unicode.org for East Asian encodings are buggy.  Don't use
them.  (Moreover, Unicode Consortium declared they don't want to fix
these bugs, and make East Asian mapping tables obsolete.  see
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/README.TXT)
I uses mapping tables from glibc for my camomile, which seems more
debugged.

> My functions are in Python, and take the form:
> 
> 	decode: string -> (int * string)
> 	encode: int -> string
> 
> where string is an 8 bit byte stream,
> and int is a unicode (or other) code point.

This interface has a problem with stateful encodings, which are quite
important here.  (ISO-2020-JP or JIS encoding is stateful, and
standard encoding for email.)  In addition, it is inefficient.

> The actual python functions use dynamically loaded
> data tables, but each character set has a fixed
> format for the tables that knows about the raw
> structure of the character set (eg what ranges of
> hi and low bytes are allowed in two byte encodings
> of Shift-Jis, KSC, etc). For Ocaml, we'd probably
> want to bind the encodings at compile time
> (since there is no well defined way to find
> the data tables at run time :(
> 
> The tables are very compact, but there are quite
> a few encodings -- some overhead if they're all
> in the one module ..

I read somewhere that Perl6 delegates code conversion to add-on
programs, since making standard mapping tables is really hard.
(Even naming of encodings is a problem.  There is no cross-platform
way of this.)  Introducing generic channel type (for char and unicode
character) and letting 3rd party libraries do conversion is better
solution, IMO.
--
Yamagata Yoriyuki
http://www.mars.sphere.ne.jp/yoriyuki/
PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


  reply	other threads:[~2002-09-01 11:52 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-01  1:45 Gerd Stolpmann
2002-09-01  8:52 ` John Max Skaller
2002-09-01 11:57   ` Yamagata Yoriyuki [this message]
2002-09-01 13:54     ` John Max Skaller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020901.205710.118791924.yoriyuki@mbg.sphere.ne.jp \
    --to=yoriyuki@mbg.sphere.ne.jp \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).