From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: * X-Spam-Status: No, score=1.0 required=5.0 tests=AWL,SPF_NEUTRAL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id F189CBC69 for ; Tue, 16 Oct 2007 20:45:09 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAEyhFEfAXQImh2dsb2JhbACOTgEBAQgKKQ X-IronPort-AV: E=Sophos;i="4.21,284,1188770400"; d="scan'208";a="3116613" Received: from discorde.inria.fr ([192.93.2.38]) by mail2-smtp-roc.national.inria.fr with ESMTP; 16 Oct 2007 20:45:09 +0200 Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l9GIj9VW028208 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK) for ; Tue, 16 Oct 2007 20:45:09 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAKSjFEdA6ba8mGdsb2JhbACOTgEBAQEHBAQTGA X-IronPort-AV: E=Sophos;i="4.21,284,1188770400"; d="scan'208";a="4610035" Received: from nf-out-0910.google.com ([64.233.182.188]) by mail3-smtp-sop.national.inria.fr with ESMTP; 16 Oct 2007 20:45:08 +0200 Received: by nf-out-0910.google.com with SMTP id g13so1800940nfb for ; Tue, 16 Oct 2007 11:45:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:received:date:to:subject:message-id:mail-followup-to:references:mime-version:content-type:content-disposition:in-reply-to:user-agent:from; bh=goZkJCPc4tN7nGtrZKbnCzuH/IQYW3pY8tMuN1OG6d8=; b=YXl/PIV1i9GVgMNSZuoIV6EMHjCGBObPehCSBQOAAqUApMV+koDQoX792LaSUM8kma38Shm1NBZvjbLQsEtu2d0P4tIV3kCCIlAuUT6T05CHtOwelVJPShm2w1jHFvd4IdON4b3hoT0iuCbrI8VQF023F82fFSkuqNm7HdVAc4A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:date:to:subject:message-id:mail-followup-to:references:mime-version:content-type:content-disposition:in-reply-to:user-agent:from; b=jpx+dyMBhSrHEkG8Pz5uLUfMZWrKIcCQUOPD60163pSFpgV34iR0h1swIeguIjlfxu0goFt76E7ZT3gxXFsWxZghj3ma4cnE7cSd2TDlvrgMT+Vs7ydZidrzPl6zrVPwtlKEUsELRLUCJGJnOwMDaPBL+5aJ1TUbvkETcVpe07M= Received: by 10.78.168.1 with SMTP id q1mr5252875hue.1192560306906; Tue, 16 Oct 2007 11:45:06 -0700 (PDT) Received: from localhost ( [83.201.28.205]) by mx.google.com with ESMTPS id 2sm267498nfv.2007.10.16.11.45.00 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 16 Oct 2007 11:45:04 -0700 (PDT) Received: by localhost (sSMTP sendmail emulation); Tue, 16 Oct 2007 20:46:21 +0200 Date: Tue, 16 Oct 2007 20:46:21 +0200 To: caml-list@inria.fr Subject: Re: Warning on home-made functions dealing with UTF-8. Message-ID: <20071016184621.GA12628@localhost> Mail-Followup-To: Julien Moutinho , caml-list@inria.fr References: <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com> <20071009165520.GA27507@snarc.org> <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com> <20071009195119.GA29263@snarc.org> <875c7e070710091504j203ff78y8703fc5c22086671@mail.gmail.com> <20071011130354.GA7016@snarc.org> <1192110864.6045.12.camel@rosella.wigram> <20071011142141.GA8001@snarc.org> <1192114097.6184.7.camel@rosella.wigram> <20071015203509.GA5212@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20071015203509.GA5212@localhost> User-Agent: Mutt/1.5.16 (2007-06-11) From: Julien Moutinho X-Miltered: at discorde with ID 471506B5.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.01; home-made:01 byte:01 decr:01 wikipedia:01 wiki:01 byte:01 boo:98 boo:98 11111:98 111111:98 111111:98 ffff:98 ffff:98 char:01 char:01 Here, I have reused some old code of mine to secure and extend J.Skaller's: unicode_of_utf8 ~ parse_utf8 utf8_of_unicode ~ utf8_of_int May it help, and may it not be too bugged. exception Bad_utf8 of string * (string * int * int * int) (* raised with an error description and its location: * bytes * start (0 < start <= String.length bytes) * size (0 < size <= String.length bytes) * position (0 <= position <= size) *) exception Insufficient of int (* raised when more bytes are needed. * The absolute value of the integer is the minimal amount of bytes needed. * A positive sign means that they have to be appended. * A negative sign means that they have to be prepended. *) let in_bounds ~(size: int) ~(pos: int) = if size <> 0 then begin if pos < 0 then begin let i = size - ((- pos) mod size) in if i = size then 0 else i end else (pos mod size) end else 0 let position__char_size__offset (bytes: string) ?(start = 0) ?(size = String.length bytes) ~(pos: int) : int * int * int = if size <= 0 then (0, 0, 0) else begin let pos = in_bounds ~size ~pos in let char_pos = start + pos in let char_start = ref char_pos in let on_tail = ref true in let loc = (bytes, start, size, pos) in (* go backward to find a head *) while !on_tail do if char_pos - !char_start > 3 then raise (Bad_utf8 ("cannot find a head nearby", loc)) else if !char_start < start then raise (Insufficient (-1)) else begin let cod = Char.code bytes.[!char_start] in if (cod land 0b1100_0000) = 0b1000_0000 (* on a trailing byte *) then decr char_start else on_tail := false end done; let char_start = !char_start in (* decode the head *) let head = Char.code bytes.[char_start] in let overlong boo = (* check for overlong forms (when a character uses more trailing bytes than needed), * see http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations *) if boo then raise (Bad_utf8 ("overlong form", loc)) in let may_be_overlong = ref false in let char_size = (* get the size of the character *) (* 0zzzzzzz -> 0zzzzzzz = 7 bits *) if (head land 0b1_0000000) = 0b0_0000000 then 1 (* 110YYYYy 10zzzzzz -> 00000yyy yyzzzzzz = 11 bits *) else if (head land 0b111_00000) = 0b110_00000 then (overlong ((head land 0b000_11110) = 0); 2) (* 1110XXXX 10Yyyyyy 10zzzzzz -> xxxxyyyy yyzzzzzz = 16 bits *) else if (head land 0b1111_0000) = 0b1110_0000 then (may_be_overlong := ((head land 0b0000_1111) = 0); 3) (* 11110WWW 10XXxxxx 10yyyyyy 10zzzzzz -> 000wwwxx xxxxyyyy yyzzzzzz = 21 bits *) else if (head land 0b1111_1000) = 0b1111_0000 then (may_be_overlong := ((head land 0b00000_111) = 0); 4) (* 4 bytes is the maximun size of an UTF-8 character by now *) else raise (Bad_utf8 ("invalid head", loc)) in (* decode the tail *) let off = ref (char_start + 1) in let t_end = start + size in let char_end = char_start + char_size in let max_off = min char_end t_end in (* check whether the trailing bytes of a character * are of the form 0b10_xxxxxx *) while !off < max_off do let cod = (Char.code bytes.[!off]) in if (cod land 0b11_000000) <> 0b10_000000 then raise (Bad_utf8 ("invalid tail", loc)); incr off done; (* complete the overlong check *) if max_off >= char_start + 1 (* if there is a second byte *) && !may_be_overlong then overlong ( (char_size = 3 && ((Char.code bytes.[char_start + 1]) land 0b00_100000) = 0) || (char_size = 4 && ((Char.code bytes.[char_start + 1]) land 0b00_110000) = 0) ); (* check the tail length *) if char_end > t_end then raise (Insufficient (char_end - (char_pos + 1))); (pos, char_size, char_pos - char_start) end let unicode_of_utf8 (bytes: string) ?(start = 0) ?(size = String.length bytes) (pos: int) : int * int = let pos, char_size, offset = position__char_size__offset bytes ~start ~size ~pos in let char_start = pos - offset in let unicode = match char_size with | 1 -> (* 0zzzzzzz -> 0zzzzzzz *) Char.code bytes.[char_start] | 2 -> (* 110yyyyy 10zzzzzz -> 00000yyy yyzzzzzz *) let cod0 = Char.code bytes.[char_start] in let cod1 = Char.code bytes.[char_start + 1] in ((cod0 land 0b000_11111) lsl 6) lor (cod1 land 0b00_111111) | 3 -> (* 1110xxxx 10yyyyyy 10zzzzzz -> xxxxyyyy yyzzzzzz *) let cod0 = Char.code bytes.[char_start] in let cod1 = Char.code bytes.[char_start + 1] in let cod2 = Char.code bytes.[char_start + 2] in ((cod0 land 0b0000_1111) lsl 12) lor ((cod1 land 0b00_111111) lsl 6) lor (cod2 land 0b00_111111) | 4 -> (* 11110www 10xxxxxx 10yyyyyy 10zzzzzz -> 000wwwxx xxxxyyyy yyzzzzzz *) let cod0 = Char.code bytes.[char_start] in let cod1 = Char.code bytes.[char_start + 1] in let cod2 = Char.code bytes.[char_start + 2] in let cod3 = Char.code bytes.[char_start + 3] in ((cod0 land 0b00000_111) lsl 18) lor ((cod1 land 0b00_111111) lsl 12) lor ((cod2 land 0b00_111111) lsl 6) lor (cod3 land 0b00_111111) | _ -> assert false in match unicode with | cod when cod >= 0xD800 && cod <= 0xDFFF -> (* The definition of UTF-8 prohibits encoding character numbers between * U+D800 and U+DFFF, which are reserved for use with the UTF-16 * encoding form (as surrogate pairs) and do not directly represent characters. *) raise (Bad_utf8 ("prohibited code point", (bytes, start, size, pos))) | cod when cod > 0x10FFFF -> raise (Bad_utf8 ("invalid code point", (bytes, start, size, pos))) | _ -> (unicode, (char_size - offset)) exception Bad_unicode of string * int (* raised with an error description and an integer * which is either a prohibited or an invalid unicode code point *) let utf8_of_unicode : int -> string = function | cod when cod >= 0x00 && cod <= 0x7F -> (* 0zzzzzzz -> 0zzzzzzz *) String.make 1 (Char.chr cod) | cod when cod <= 0x07FF -> (* 00000yyy yyzzzzzz -> 110yyyyy 10zzzzzz *) let str = String.create 2 in str.[0] <- Char.chr (0b110_00000 lor (cod lsr 6)); str.[1] <- Char.chr (0b10_000000 lor (cod land 0b00_111111)); str | cod when cod >= 0xD800 && cod <= 0xDFFF -> (* The definition of UTF-8 prohibits encoding character numbers between * U+D800 and U+DFFF, which are reserved for use with the UTF-16 * encoding form (as surrogate pairs) and do not directly represent characters. *) raise (Bad_unicode ("prohibited code point", cod)) | cod when cod <= 0xFFFF -> (* xxxxyyyy yyzzzzzz -> 1110xxxx 10yyyyyy 10zzzzzz *) let str = String.create 3 in str.[0] <- Char.chr (0b1110_0000 lor (cod lsr 12)); str.[1] <- Char.chr (0b10_000000 lor ((cod lsr 6) land 0b00_111111)); str.[2] <- Char.chr (0b10_000000 lor ( cod land 0b00_111111)); str | cod when cod <= 0x10FFFF -> (* 000wwwxx xxxxyyyy yyzzzzzz -> 11110www 10xxxxxx 10yyyyyy 10zzzzzz *) let str = String.create 4 in str.[0] <- Char.chr (0b11110_000 lor ( cod lsr 18)); str.[1] <- Char.chr (0b10_000000 lor ((cod lsr 12) land 0b00_111111)); str.[2] <- Char.chr (0b10_000000 lor ((cod lsr 6) land 0b00_111111)); str.[3] <- Char.chr (0b10_000000 lor ( cod land 0b00_111111)); str | cod -> raise (Bad_unicode ("invalid code point", cod))