caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Julien Moutinho <julien.moutinho@gmail.com>
To: caml-list@yquem.inria.fr
Subject: Warning on home-made functions dealing with UTF-8.
Date: Mon, 15 Oct 2007 22:35:10 +0200	[thread overview]
Message-ID: <20071015203509.GA5212@localhost> (raw)
In-Reply-To: <1192114097.6184.7.camel@rosella.wigram>

On Fri, Oct 12, 2007 at 12:48:16AM +1000, skaller wrote:
> On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote:
> > On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
> > > You can't: Camomile is massive for a reason.. the problem it
> > > aims to solve is complex and hard to do efficiently without
> > > a large set of specialised functions.
> > 
> > You are assuming that i want efficiency where i want to print few
> > unicode string in an ui here and there. I *DON'T* want to be exposed to
> > full unicode, i need something like 1/100 of camomile library.
> 
> In that case, you can use an int Array.t for Unicode provided 
> it is only 31 bit OR you have a 64 bit machine. These routines 
> should help converting to and from UTF-8:
> [...]

Just in case someone would want to use this parse_utf8,
be aware that depending on the trust you have in your input,
it may be sorely discouraged to do so.
Indeed, this code does not check comprehensively for invalid characters.

eg. for characters with an overlong form [1]:

# let mk = List.fold_left
    (fun acc c -> acc ^ String.make 1 (Char.chr c)) "";;
val mk : int list -> string = <fun>
# let p l = parse_utf8 (mk l) 0;;
val p : int list -> int * int = <fun>

(* unicode 0 coded into an overlong utf-8 form *)
# p [0b11_000000; 0b10_000000];;
- : int * int = (0, 2)

Nor does it checks for invalid trailing bytes :

(* unicode 64 (@) with and invalid trailing byte,
 * which happens to be a zero *)
# p [0b11_000001; 0b00_00000];;
- : int * int = (64, 2)

Besides "now" an unicode value needs only 21 bits
and "therefore" an utf-8 char holds into at most 4 bytes,
not 6 as the code handles.

[1] http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations


  parent reply	other threads:[~2007-10-15 20:34 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-08 15:08 Correct way of programming a CGI script Tom
2007-10-08 15:32 ` [Caml-list] " Dario Teixeira
2007-10-08 16:04 ` Gerd Stolpmann
2007-10-08 21:37   ` skaller
2007-10-08 22:21     ` Erik de Castro Lopo
2007-10-08 23:05       ` skaller
2007-10-08 23:19         ` skaller
2007-10-08 23:23           ` Arnaud Spiwack
2007-10-08 23:47             ` skaller
2007-10-09  5:49         ` David Teller
2007-10-09 10:15         ` Christophe TROESTLER
2007-10-09 15:29           ` skaller
2007-10-09 15:49             ` Vincent Hanquez
2007-10-09 16:00               ` Jon Harrop
2007-10-09 14:02         ` William D. Neumann
2007-10-09 15:25           ` skaller
2007-10-09 15:33             ` William D. Neumann
2007-10-09 15:48             ` Jon Harrop
2007-10-08 23:37       ` skaller
2007-10-09 10:20         ` Christophe TROESTLER
2007-10-09 13:40           ` Rope is the new string Jon Harrop
2007-10-09 15:57             ` [Caml-list] " Vincent Hanquez
2007-10-09 16:42               ` Loup Vaillant
2007-10-09 16:55                 ` Vincent Hanquez
2007-10-09 17:32                   ` Loup Vaillant
2007-10-09 19:51                     ` Vincent Hanquez
2007-10-09 21:06                       ` Loup Vaillant
2007-10-10  7:35                         ` Vincent Hanquez
2007-10-10  8:05                           ` Loup Vaillant
2007-10-11 13:23                             ` Vincent Hanquez
2007-10-09 22:04                       ` Chris King
2007-10-11 13:03                         ` Vincent Hanquez
2007-10-11 13:54                           ` skaller
2007-10-11 14:21                             ` Vincent Hanquez
2007-10-11 14:27                               ` Benjamin Monate
2007-10-11 14:48                               ` skaller
2007-10-11 21:16                                 ` Alain Frisch
2007-10-15 20:35                                 ` Julien Moutinho [this message]
2007-10-15 23:51                                   ` [Caml-list] Warning on home-made functions dealing with UTF-8 skaller
2007-10-16  2:21                                     ` Julien Moutinho
2007-10-16 18:46                                   ` Julien Moutinho
2007-10-16 18:51                                     ` Julien Moutinho
2007-10-17  2:23                                     ` [Caml-list] " skaller
2007-10-09 10:26     ` [Caml-list] Correct way of programming a CGI script Gerd Stolpmann
2007-10-09 15:16       ` skaller
2007-10-09 15:31         ` William D. Neumann
2007-10-09 12:52     ` Brian Hurt
2007-10-09 13:56   ` Jon Harrop
2007-10-09 15:18     ` William D. Neumann
2007-10-08 16:11 ` Loup Vaillant
2007-10-08 19:07   ` Christophe TROESTLER

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071015203509.GA5212@localhost \
    --to=julien.moutinho@gmail.com \
    --cc=caml-list@yquem.inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).