From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by yquem.inria.fr (Postfix) with ESMTP id 06642BC69 for ; Tue, 16 Oct 2007 02:00:29 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAMiaE0fLENaMn2dsb2JhbACOSAEBAQEHBAYJCBiBJw X-IronPort-AV: E=Sophos;i="4.21,279,1188770400"; d="scan'208";a="18048616" Received: from ipmail01.adl2.internode.on.net ([203.16.214.140]) by mail4-smtp-sop.national.inria.fr with ESMTP; 16 Oct 2007 02:00:27 +0200 X-IronPort-AV: E=Sophos;i="4.21,279,1188743400"; d="scan'208";a="211114956" Received: from ppp121-44-36-108.lns10.syd7.internode.on.net (HELO [192.168.1.201]) ([121.44.36.108]) by ipmail01.adl2.internode.on.net with ESMTP; 16 Oct 2007 09:21:23 +0930 Subject: Re: [Caml-list] Warning on home-made functions dealing with UTF-8. From: skaller To: Julien Moutinho Cc: caml-list@yquem.inria.fr In-Reply-To: <20071015203509.GA5212@localhost> References: <20071009155702.GB26282@snarc.org> <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com> <20071009165520.GA27507@snarc.org> <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com> <20071009195119.GA29263@snarc.org> <875c7e070710091504j203ff78y8703fc5c22086671@mail.gmail.com> <20071011130354.GA7016@snarc.org> <1192110864.6045.12.camel@rosella.wigram> <20071011142141.GA8001@snarc.org> <1192114097.6184.7.camel@rosella.wigram> <20071015203509.GA5212@localhost> Content-Type: text/plain Date: Tue, 16 Oct 2007 09:51:16 +1000 Message-Id: <1192492276.22001.27.camel@rosella.wigram> Mime-Version: 1.0 X-Mailer: Evolution 2.12.0 Content-Transfer-Encoding: 7bit X-Spam: no; 0.00; home-made:01 0200,:01 0200,:01 camomile:01 camomile:01 sourceforge:01 wrote:01 wrote:01 readable:01 caml-list:01 functions:01 functions:01 int:01 parse:02 string:02 On Mon, 2007-10-15 at 22:35 +0200, Julien Moutinho wrote: > On Fri, Oct 12, 2007 at 12:48:16AM +1000, skaller wrote: > > On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote: > > > On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote: > > > > You can't: Camomile is massive for a reason.. the problem it > > > > aims to solve is complex and hard to do efficiently without > > > > a large set of specialised functions. > > > > > > You are assuming that i want efficiency where i want to print few > > > unicode string in an ui here and there. I *DON'T* want to be exposed to > > > full unicode, i need something like 1/100 of camomile library. > > > > In that case, you can use an int Array.t for Unicode provided > > it is only 31 bit OR you have a 64 bit machine. These routines > > should help converting to and from UTF-8: > > [...] > > Just in case someone would want to use this parse_utf8, > be aware that depending on the trust you have in your input, > it may be sorely discouraged to do so. > Indeed, this code does not check comprehensively for invalid characters. That is correct. It is specifically designed NOT to do so. The last thing you want in 99% of codec use is to abort due to an error. Try switching codecs on Firefox.. do you really want to abort if you have bad input or the wrong codec? UTF-8 is primarily used for Unicode which is human readable text. Errors and faults in the text aren't important most of the time. It has nothing to do with a 'trusted' source. It has to do with the fact that the human text is an approximation in the first place. -- John Skaller Felix, successor to C++: http://felix.sf.net