From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=AWL,DNS_FROM_RFC_ABUSE autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id 921EBBC69 for ; Wed, 17 Oct 2007 04:56:18 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAHwWFUfAXQImh2dsb2JhbACOTgIBCAopgSc X-IronPort-AV: E=Sophos;i="4.21,286,1188770400"; d="scan'208";a="3133060" Received: from discorde.inria.fr ([192.93.2.38]) by mail2-smtp-roc.national.inria.fr with ESMTP; 17 Oct 2007 04:56:18 +0200 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l9H2uHl4012261 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK) for ; Wed, 17 Oct 2007 04:56:18 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAANoWFUfLENaMn2dsb2JhbACOTgIBAQcEBgkIGIEn X-IronPort-AV: E=Sophos;i="4.21,286,1188770400"; d="scan'208";a="18119464" Received: from ipmail01.adl2.internode.on.net ([203.16.214.140]) by mail4-smtp-sop.national.inria.fr with ESMTP; 17 Oct 2007 04:56:21 +0200 X-IronPort-AV: E=Sophos;i="4.21,286,1188743400"; d="scan'208";a="212142793" Received: from ppp121-44-36-108.lns10.syd7.internode.on.net (HELO [192.168.1.201]) ([121.44.36.108]) by ipmail01.adl2.internode.on.net with ESMTP; 17 Oct 2007 11:53:34 +0930 Subject: Re: [Caml-list] Re: Warning on home-made functions dealing with UTF-8. From: skaller To: Julien Moutinho Cc: caml-list@inria.fr In-Reply-To: <20071016184621.GA12628@localhost> References: <6f9f8f4a0710090942u2afe6c5erc2d5b11ecfff2253@mail.gmail.com> <20071009165520.GA27507@snarc.org> <6f9f8f4a0710091032r3dace80bi4f9b584ae5056675@mail.gmail.com> <20071009195119.GA29263@snarc.org> <875c7e070710091504j203ff78y8703fc5c22086671@mail.gmail.com> <20071011130354.GA7016@snarc.org> <1192110864.6045.12.camel@rosella.wigram> <20071011142141.GA8001@snarc.org> <1192114097.6184.7.camel@rosella.wigram> <20071015203509.GA5212@localhost> <20071016184621.GA12628@localhost> Content-Type: text/plain Date: Wed, 17 Oct 2007 12:23:33 +1000 Message-Id: <1192587813.31367.77.camel@rosella.wigram> Mime-Version: 1.0 X-Mailer: Evolution 2.12.0 Content-Transfer-Encoding: 7bit X-Miltered: at discorde with ID 471579D1.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; home-made:01 0200,:01 inserting:01 camomile:01 translating:01 'home:98 sourceforge:01 wrote:01 encode:01 encode:01 exception:01 caml-list:01 functions:01 exceptions:01 exceptions:01 On Tue, 2007-10-16 at 20:46 +0200, Julien Moutinho wrote: > Here, I have reused some old code of mine to secure and extend J.Skaller's: > unicode_of_utf8 ~ parse_utf8 > utf8_of_unicode ~ utf8_of_int > May it help, and may it not be too bugged. The UTF-8 to UCS4 string conversion probably needs an option NOT to throw exceptions. Almost all uses are not serviced well by throwing exceptions. * Exceptions are a very bad idea in the first place :) * An alternative strategy using a replacement code may be worth considering, possibly using an argument, and possibly by another technique such as a wrapper function which continues on after inserting the replacement. I think you will find that covering all the real uses isn't so easy .. one reason for a whole framework like Camomile. In particular, most Standards will say things like "behaviour is undefined if such and such" for example an invalid code. This does NOT mean using such a code is an error, it means the Standard leaves the behaviour open in such cases. In particular, open to vendor extensions. It's perfectly legitimate, for example, for an application to encode colour and font information in the 31-21=10 remaining bits**, but your codec will throw an exception here, instead of just translating the codes. What the application is doing isn't portable -- but that doesn't make it incorrect if the application has control over the context. In particular, the application can even define an extension to the Standard and send such codes over file systems and networks. Other applications may decide to support the extensions. In fact this is how Standards are made: the ISO mantra is that standards should encode existing practice, and that specifically implies ALLOWING practice outside the existing Standards. So my routine wasn't quite so 'home grown' .. I've actually been a participant in ISO Standardisation processes with a special National Body interest in I18n issues (since Australia is highly multi-cultural and has people speaking many languages). I18n is a real quagmire of complexity... for example there are no known commercial text rendering routines that actually comply with the Standard -- bidirectional rendering is extremely difficult to do efficiently and it isn't clear that the Standard requirements are all that useful if you happen to be mixing English, Arabic, and Chinese in the same document. ** there is a real use of the extra bits by some Egyptologists encoding hieroglyphics. -- John Skaller Felix, successor to C++: http://felix.sf.net