caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Michael Ekstrand <michael+ocaml@elehack.net>
To: caml-list@inria.fr
Subject: Re: Storing UTF-8 in plain strings
Date: Wed, 12 Aug 2009 13:24:10 -0500	[thread overview]
Message-ID: <h5v1bv$otn$1@ger.gmane.org> (raw)
In-Reply-To: <51021.92432.qm@web111506.mail.gq1.yahoo.com>

Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.
> 
> - String concatenation is allowed.
> 
> - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
>   because in a multi-byte sequence all bytes have a value > 127.  There is
>   therefore no chance of splitting a multi-byte sequence down the middle.
> 
> 
> So, can someone find any problems with this reasoning?  (Thanks in advance!)

It looks good to me.  Further, much of the functionality you're
forbidding yourself in String is provided by Extlib's UTF8 module
without additional dependencies if you do need it at some place in your
program.

You can also go ahead and use buffers, sprintf, etc., so long as
everything is valid UTF-8.  They won't care; they'll only see a sequence
of bytes.

- Michael


  reply	other threads:[~2009-08-12 18:24 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-12 17:36 Dario Teixeira
2009-08-12 18:24 ` Michael Ekstrand [this message]
2009-08-12 20:51 ` [Caml-list] " Edgar Friendly
2009-08-12 21:33   ` Jake Donham
2009-08-13  6:10 ` Florian Hars
2009-08-13  9:37 ` Dario Teixeira
2009-08-13 13:14   ` Dario Teixeira
2009-08-13 11:00 ` Richard Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='h5v1bv$otn$1@ger.gmane.org' \
    --to=michael+ocaml@elehack.net \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).