caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: John Max Skaller <skaller@ozemail.com.au>
To: Yamagata Yoriyuki <yoriyuki@mbg.sphere.ne.jp>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
Date: Sun, 01 Sep 2002 23:54:29 +1000	[thread overview]
Message-ID: <3D721C15.5000308@ozemail.com.au> (raw)
In-Reply-To: <20020901.205710.118791924.yoriyuki@mbg.sphere.ne.jp>

Yamagata Yoriyuki wrote:


> Data at Unicode.org for East Asian encodings are buggy.  Don't use
> them.


Noted.

>>My functions are in Python, and take the form:
>>
>>	decode: string -> (int * string)
>>	encode: int -> string
>>
>>where string is an 8 bit byte stream,
>>and int is a unicode (or other) code point.
>>
> 
> This interface has a problem with stateful encodings, which are quite
> important here.  (ISO-2020-JP or JIS encoding is stateful, and
> standard encoding for email.)  In addition, it is inefficient.


Agree on both counts, though none of the encodings I handle
are stateful (I handle Shift-Jis which isn't stateful AFAIK)

The functions I give are canonical, and they're fast
enough in Python (if you want fast, you'd use C anyhow).
There is an issue for Ocaml: what is a Unicode string like?
My answer would be 'array of int'. But another answer
is 'string with UTF-8 encoding'.


In theory, mappings and codecs are orthogonal.

UTF-8 has nothing to do with Unicode, it works
just fine for any national character set.
In practice, many character sets are defined
by two byte encodings.

So you might want a function:

	Shift-Jis -> Unicode as UTF-8

modelled by

	string -> string (8 bit clean strings)

That can be made from the canonical functions,

but it isn't efficient to do the conversion
via an integer intermediate form.


> I read somewhere that Perl6 delegates code conversion to add-on
> programs, since making standard mapping tables is really hard.
> (Even naming of encodings is a problem.  There is no cross-platform
> way of this.)  Introducing generic channel type (for char and unicode
> character) and letting 3rd party libraries do conversion is better
> solution, IMO.


Well, you also want in-core conversions. And then a third
party library is an arbitrary function. The problem
is that people are rewriting these functions for each
application that needs some i18n support. Reuse would
be better, but that requires some form of
standardisation. Its both hard to get the conversions
right, and also to make them efficient. I spent ages
converting the unicode.org data (I also found a bug
in the UNICODE tables).

The problem is: 'third party libraries' might be a reasonable
answer for a C program. Its not so reasonable for Ocaml:
where are they? We're short of useful libraries .. indeed,
for a mechanism to install and access them.

-- 
John Max Skaller, mailto:skaller@ozemail.com.au
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


      reply	other threads:[~2002-09-01 13:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-01  1:45 Gerd Stolpmann
2002-09-01  8:52 ` John Max Skaller
2002-09-01 11:57   ` Yamagata Yoriyuki
2002-09-01 13:54     ` John Max Skaller [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3D721C15.5000308@ozemail.com.au \
    --to=skaller@ozemail.com.au \
    --cc=caml-list@inria.fr \
    --cc=yoriyuki@mbg.sphere.ne.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).