Storing UTF-8 in plain strings

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Storing UTF-8 in plain strings
@ 2009-08-12 17:36 Dario Teixeira
  2009-08-12 18:24 ` Michael Ekstrand
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Dario Teixeira @ 2009-08-12 17:36 UTC (permalink / raw)
  To: caml-list

Hi,

I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
on plain strings for processing and storing data.  I *think* I can get away
with using only the String module to handle this variable-length encoding
as long as I am careful with the way I treat these strings.  Here are the
assumptions I am making:

- If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
  I can therefore assume in subsequent steps that the source is compliant.

- It is forbidden to use String.get, String.sub, String.length, or other
  functions where awareness of variable-length encoding is required.

- String concatenation is allowed.

- Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
  because in a multi-byte sequence all bytes have a value > 127.  There is
  therefore no chance of splitting a multi-byte sequence down the middle.


So, can someone find any problems with this reasoning?  (Thanks in advance!)

Best regards,
Dario Teixeira

P.S. And yes, I am aware that there are excellent libraries for handling
     UTF-8 (like the Rope module in Batteries).






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Storing UTF-8 in plain strings
  2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
@ 2009-08-12 18:24 ` Michael Ekstrand
  2009-08-12 20:51 ` [Caml-list] " Edgar Friendly
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Michael Ekstrand @ 2009-08-12 18:24 UTC (permalink / raw)
  To: caml-list

Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.
> 
> - String concatenation is allowed.
> 
> - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
>   because in a multi-byte sequence all bytes have a value > 127.  There is
>   therefore no chance of splitting a multi-byte sequence down the middle.
> 
> 
> So, can someone find any problems with this reasoning?  (Thanks in advance!)

It looks good to me.  Further, much of the functionality you're
forbidding yourself in String is provided by Extlib's UTF8 module
without additional dependencies if you do need it at some place in your
program.

You can also go ahead and use buffers, sprintf, etc., so long as
everything is valid UTF-8.  They won't care; they'll only see a sequence
of bytes.

- Michael


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Storing UTF-8 in plain strings
  2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
  2009-08-12 18:24 ` Michael Ekstrand
@ 2009-08-12 20:51 ` Edgar Friendly
  2009-08-12 21:33   ` Jake Donham
  2009-08-13  6:10 ` Florian Hars
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Edgar Friendly @ 2009-08-12 20:51 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
This is the weakest assumption of the four - Ulex could parse and only
raise MalFormed on some errors.  I'm no expert on Ulex, though...

> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.
> 
Yes, those functions work on bytes, not on characters.

> - String concatenation is allowed.
> 
Yes, two valid UTF-8 strings concatenate into another valid UTF-8 string.

> - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
>   because in a multi-byte sequence all bytes have a value > 127.  There is
>   therefore no chance of splitting a multi-byte sequence down the middle.
>
Yes, you can split on low bytes, multibyte characters start with
0b11xx xxxx and continue with 0b10xx xxxx.

E


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Storing UTF-8 in plain strings
  2009-08-12 20:51 ` [Caml-list] " Edgar Friendly
@ 2009-08-12 21:33   ` Jake Donham
  0 siblings, 0 replies; 8+ messages in thread
From: Jake Donham @ 2009-08-12 21:33 UTC (permalink / raw)
  To: Edgar Friendly; +Cc: Dario Teixeira, caml-list

On Wed, Aug 12, 2009 at 1:51 PM, Edgar Friendly<thelema314@gmail.com> wrote:
>> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>>   I can therefore assume in subsequent steps that the source is compliant.
>>
> This is the weakest assumption of the four - Ulex could parse and only
> raise MalFormed on some errors.  I'm no expert on Ulex, though...

The original poster might be interested in the Netconversion module
from Ocamlnet, which is designed to work with UTF-8 stored in OCaml
strings. In particular it has a function

  Netconversion.verify `Enc_utf8 s

which checks that s is a valid UTF-8 string. It also has equivalents
for String.{get,sub,length}.

Jake


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Storing UTF-8 in plain strings
  2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
  2009-08-12 18:24 ` Michael Ekstrand
  2009-08-12 20:51 ` [Caml-list] " Edgar Friendly
@ 2009-08-13  6:10 ` Florian Hars
  2009-08-13  9:37 ` Dario Teixeira
  2009-08-13 11:00 ` Richard Jones
  4 siblings, 0 replies; 8+ messages in thread
From: Florian Hars @ 2009-08-13  6:10 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

Dario Teixeira schrieb:
> So, can someone find any problems with this reasoning?

No, the kind of compatibility with legacy code you described is
one of the original design goals of UTF-8, see
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

- Florian.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Storing UTF-8 in plain strings
  2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
                   ` (2 preceding siblings ...)
  2009-08-13  6:10 ` Florian Hars
@ 2009-08-13  9:37 ` Dario Teixeira
  2009-08-13 13:14   ` Dario Teixeira
  2009-08-13 11:00 ` Richard Jones
  4 siblings, 1 reply; 8+ messages in thread
From: Dario Teixeira @ 2009-08-13  9:37 UTC (permalink / raw)
  To: caml-list

Hi,

> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:

Thank you all for your comments.  Ulex has caught all the intentionally
malformed code points I've inserted in the stream, so I'm fairly confident
it's up to the task.  But if I find a problem I'll keep Netconversion's
and Extlib's validation functions in mind...

Best regards,
Dario Teixeira






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Storing UTF-8 in plain strings
  2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
                   ` (3 preceding siblings ...)
  2009-08-13  9:37 ` Dario Teixeira
@ 2009-08-13 11:00 ` Richard Jones
  4 siblings, 0 replies; 8+ messages in thread
From: Richard Jones @ 2009-08-13 11:00 UTC (permalink / raw)
  To: Dario Teixeira; +Cc: caml-list

On Wed, Aug 12, 2009 at 10:36:56AM -0700, Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.

Needless to say, don't use String.uppercase, String.lowercase,
String.capitalize, String.uncapitalize, Char.uppercase or
Char.lowercase.  These all assume ISO-8859-1.

I've written a number of applications which used UTF-8 extensively,
including one which worked entirely in Japanese, and I've never had a
problem.  Just avoid the bad String/Char functions.  Use either a
database or a module like Ulex/Camomile.  You'll be fine.

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Caml-list] Storing UTF-8 in plain strings
  2009-08-13  9:37 ` Dario Teixeira
@ 2009-08-13 13:14   ` Dario Teixeira
  0 siblings, 0 replies; 8+ messages in thread
From: Dario Teixeira @ 2009-08-13 13:14 UTC (permalink / raw)
  To: caml-list

Hi,

> Thank you all for your comments.  Ulex has caught all the intentionally
> malformed code points I've inserted in the stream, so I'm fairly confident
> it's up to the task.  But if I find a problem I'll keep Netconversion's
> and Extlib's validation functions in mind...

By the way, I just noticed that the 'validate' function in Extlib's UTF8
module accepts 5-byte and 6-byte sequences.  Though these were part of
UTF-8's original specification, they have been deprecated by RFC 3629.
Perhaps adding a 'Deprecated_code' exception for these cases is in order?
(Or just raise the existing 'Malformed_code' exception).  Note that Ulex
correctly raises an exception if any of these deprecated sequences are
found.

Cheers,
Dario Teixeira

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-13 13:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
2009-08-12 18:24 ` Michael Ekstrand
2009-08-12 20:51 ` [Caml-list] " Edgar Friendly
2009-08-12 21:33   ` Jake Donham
2009-08-13  6:10 ` Florian Hars
2009-08-13  9:37 ` Dario Teixeira
2009-08-13 13:14   ` Dario Teixeira
2009-08-13 11:00 ` Richard Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).