caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Announcement: PXP 1.1.92 (development version)
@ 2002-09-01  1:45 Gerd Stolpmann
  2002-09-01  8:52 ` John Max Skaller
  0 siblings, 1 reply; 4+ messages in thread
From: Gerd Stolpmann @ 2002-09-01  1:45 UTC (permalink / raw)
  To: caml-list

Hi list,

there is a new development version of PXP: 1.1.92. This version
focuses on cleaning up the way lexers are generated. There is a new tool, 
lexpp, that generates the lexers from only five files. Furthermore, much 
more 8 bit character sets are now supported as internal encodings. In 
previous versions of PXP, the internal representation of the XML trees was 
restricted to either UTF-8 or ISO-8859-1. Now, a number of additional 
encodings are supported, including the whole ISO-8859 series. 

Bugfix: If the processing instruction <?xml...?> occurs in the middle of the
XML document, version 1.1.91 will immediately stop parsing, and ignore the 
rest of the file. This is now fixed.

The new version can be found at the usual place:

http://www.ocaml-programming.de/packages/pxp-1.1.92.tar.gz

Gerd

------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
  2002-09-01  1:45 [Caml-list] Announcement: PXP 1.1.92 (development version) Gerd Stolpmann
@ 2002-09-01  8:52 ` John Max Skaller
  2002-09-01 11:57   ` Yamagata Yoriyuki
  0 siblings, 1 reply; 4+ messages in thread
From: John Max Skaller @ 2002-09-01  8:52 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: caml-list

Gerd Stolpmann wrote:


> previous versions of PXP, the internal representation of the XML trees was 
> restricted to either UTF-8 or ISO-8859-1. Now, a number of additional 
> encodings are supported, including the whole ISO-8859 series. 


I have ALL the code sets specified at Unicode.org in
programmatic form. Easy to generate Ocaml versions
of the tables.

however, how about developing a standard I18n library
with an eye to future inclusion in the standard
distribution?

The questions are mainly: what form should the
encode/decode functions take?

My functions are in Python, and take the form:

	decode: string -> (int * string)
	encode: int -> string

where string is an 8 bit byte stream,
and int is a unicode (or other) code point.

The actual python functions use dynamically loaded
data tables, but each character set has a fixed
format for the tables that knows about the raw
structure of the character set (eg what ranges of
hi and low bytes are allowed in two byte encodings
of Shift-Jis, KSC, etc). For Ocaml, we'd probably
want to bind the encodings at compile time
(since there is no well defined way to find
the data tables at run time :(

The tables are very compact, but there are quite
a few encodings -- some overhead if they're all
in the one module ..


-- 
John Max Skaller, mailto:skaller@ozemail.com.au
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
  2002-09-01  8:52 ` John Max Skaller
@ 2002-09-01 11:57   ` Yamagata Yoriyuki
  2002-09-01 13:54     ` John Max Skaller
  0 siblings, 1 reply; 4+ messages in thread
From: Yamagata Yoriyuki @ 2002-09-01 11:57 UTC (permalink / raw)
  To: caml-list

From: John Max Skaller <skaller@ozemail.com.au>
Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
Date: Sun, 01 Sep 2002 18:52:20 +1000

> I have ALL the code sets specified at Unicode.org in
> programmatic form. Easy to generate Ocaml versions
> of the tables.

Data at Unicode.org for East Asian encodings are buggy.  Don't use
them.  (Moreover, Unicode Consortium declared they don't want to fix
these bugs, and make East Asian mapping tables obsolete.  see
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/README.TXT)
I uses mapping tables from glibc for my camomile, which seems more
debugged.

> My functions are in Python, and take the form:
> 
> 	decode: string -> (int * string)
> 	encode: int -> string
> 
> where string is an 8 bit byte stream,
> and int is a unicode (or other) code point.

This interface has a problem with stateful encodings, which are quite
important here.  (ISO-2020-JP or JIS encoding is stateful, and
standard encoding for email.)  In addition, it is inefficient.

> The actual python functions use dynamically loaded
> data tables, but each character set has a fixed
> format for the tables that knows about the raw
> structure of the character set (eg what ranges of
> hi and low bytes are allowed in two byte encodings
> of Shift-Jis, KSC, etc). For Ocaml, we'd probably
> want to bind the encodings at compile time
> (since there is no well defined way to find
> the data tables at run time :(
> 
> The tables are very compact, but there are quite
> a few encodings -- some overhead if they're all
> in the one module ..

I read somewhere that Perl6 delegates code conversion to add-on
programs, since making standard mapping tables is really hard.
(Even naming of encodings is a problem.  There is no cross-platform
way of this.)  Introducing generic channel type (for char and unicode
character) and letting 3rd party libraries do conversion is better
solution, IMO.
--
Yamagata Yoriyuki
http://www.mars.sphere.ne.jp/yoriyuki/
PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Announcement: PXP 1.1.92 (development version)
  2002-09-01 11:57   ` Yamagata Yoriyuki
@ 2002-09-01 13:54     ` John Max Skaller
  0 siblings, 0 replies; 4+ messages in thread
From: John Max Skaller @ 2002-09-01 13:54 UTC (permalink / raw)
  To: Yamagata Yoriyuki; +Cc: caml-list

Yamagata Yoriyuki wrote:


> Data at Unicode.org for East Asian encodings are buggy.  Don't use
> them.


Noted.

>>My functions are in Python, and take the form:
>>
>>	decode: string -> (int * string)
>>	encode: int -> string
>>
>>where string is an 8 bit byte stream,
>>and int is a unicode (or other) code point.
>>
> 
> This interface has a problem with stateful encodings, which are quite
> important here.  (ISO-2020-JP or JIS encoding is stateful, and
> standard encoding for email.)  In addition, it is inefficient.


Agree on both counts, though none of the encodings I handle
are stateful (I handle Shift-Jis which isn't stateful AFAIK)

The functions I give are canonical, and they're fast
enough in Python (if you want fast, you'd use C anyhow).
There is an issue for Ocaml: what is a Unicode string like?
My answer would be 'array of int'. But another answer
is 'string with UTF-8 encoding'.


In theory, mappings and codecs are orthogonal.

UTF-8 has nothing to do with Unicode, it works
just fine for any national character set.
In practice, many character sets are defined
by two byte encodings.

So you might want a function:

	Shift-Jis -> Unicode as UTF-8

modelled by

	string -> string (8 bit clean strings)

That can be made from the canonical functions,

but it isn't efficient to do the conversion
via an integer intermediate form.


> I read somewhere that Perl6 delegates code conversion to add-on
> programs, since making standard mapping tables is really hard.
> (Even naming of encodings is a problem.  There is no cross-platform
> way of this.)  Introducing generic channel type (for char and unicode
> character) and letting 3rd party libraries do conversion is better
> solution, IMO.


Well, you also want in-core conversions. And then a third
party library is an arbitrary function. The problem
is that people are rewriting these functions for each
application that needs some i18n support. Reuse would
be better, but that requires some form of
standardisation. Its both hard to get the conversions
right, and also to make them efficient. I spent ages
converting the unicode.org data (I also found a bug
in the UNICODE tables).

The problem is: 'third party libraries' might be a reasonable
answer for a C program. Its not so reasonable for Ocaml:
where are they? We're short of useful libraries .. indeed,
for a mechanism to install and access them.

-- 
John Max Skaller, mailto:skaller@ozemail.com.au
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-09-01 13:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-01  1:45 [Caml-list] Announcement: PXP 1.1.92 (development version) Gerd Stolpmann
2002-09-01  8:52 ` John Max Skaller
2002-09-01 11:57   ` Yamagata Yoriyuki
2002-09-01 13:54     ` John Max Skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).