caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Storing UTF-8 in plain strings
@ 2009-08-12 17:36 Dario Teixeira
  2009-08-12 18:24 ` Michael Ekstrand
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Dario Teixeira @ 2009-08-12 17:36 UTC (permalink / raw)
  To: caml-list

Hi,

I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
on plain strings for processing and storing data.  I *think* I can get away
with using only the String module to handle this variable-length encoding
as long as I am careful with the way I treat these strings.  Here are the
assumptions I am making:

- If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
  I can therefore assume in subsequent steps that the source is compliant.

- It is forbidden to use String.get, String.sub, String.length, or other
  functions where awareness of variable-length encoding is required.

- String concatenation is allowed.

- Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
  because in a multi-byte sequence all bytes have a value > 127.  There is
  therefore no chance of splitting a multi-byte sequence down the middle.


So, can someone find any problems with this reasoning?  (Thanks in advance!)

Best regards,
Dario Teixeira

P.S. And yes, I am aware that there are excellent libraries for handling
     UTF-8 (like the Rope module in Batteries).






^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-13 13:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-12 17:36 Storing UTF-8 in plain strings Dario Teixeira
2009-08-12 18:24 ` Michael Ekstrand
2009-08-12 20:51 ` [Caml-list] " Edgar Friendly
2009-08-12 21:33   ` Jake Donham
2009-08-13  6:10 ` Florian Hars
2009-08-13  9:37 ` Dario Teixeira
2009-08-13 13:14   ` Dario Teixeira
2009-08-13 11:00 ` Richard Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).