From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 145447FB38 for ; Tue, 23 Dec 2014 19:58:21 +0100 (CET) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of daniel.buenzli@erratique.ch) identity=pra; client-ip=74.55.86.74; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="daniel.buenzli@erratique.ch"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of daniel.buenzli@erratique.ch) identity=mailfrom; client-ip=74.55.86.74; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="daniel.buenzli@erratique.ch"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@smtp.webfaction.com) identity=helo; client-ip=74.55.86.74; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="postmaster@smtp.webfaction.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AiQGAMC6mVRKN1ZKm2dsb2JhbABbg1hYAYFhgSLDPocmAQEBAQERAQEBAQEGCwsJFC6ENm8cAiYCSRYbAYgjBAmad40kj0SWHYEhjW0CEQGDPy6BEwWRU4ZBMIQ4D4U7glaDOYFzgh5uAYECCReBIAEBAQ X-IPAS-Result: AiQGAMC6mVRKN1ZKm2dsb2JhbABbg1hYAYFhgSLDPocmAQEBAQERAQEBAQEGCwsJFC6ENm8cAiYCSRYbAYgjBAmad40kj0SWHYEhjW0CEQGDPy6BEwWRU4ZBMIQ4D4U7glaDOYFzgh5uAYECCReBIAEBAQ X-IronPort-AV: E=Sophos;i="5.07,633,1413237600"; d="scan'208";a="114693756" Received: from mail6.webfaction.com (HELO smtp.webfaction.com) ([74.55.86.74]) by mail2-smtp-roc.national.inria.fr with ESMTP; 23 Dec 2014 19:58:19 +0100 Received: from [85.218.48.179] (85-218-48-179.dclient.lsne.ch [85.218.48.179]) by smtp.webfaction.com (Postfix) with ESMTP id BEFA220AA19F for ; Tue, 23 Dec 2014 18:58:16 +0000 (UTC) Date: Tue, 23 Dec 2014 19:58:14 +0100 From: =?utf-8?Q?Daniel_B=C3=BCnzli?= To: Caml-list Message-ID: X-Mailer: sparrow 1.6.4 (build 1178) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Subject: [Caml-list] [ANN] Uuseg 0.8.0 Unicode Text Segmentation Hello, I'd like to announce the first release of Uuseg which should be available shortly in OPAM. Here's the blurb: Uuseg is an OCaml library for segmenting Unicode text. It implements the locale independent Unicode text segmentation algorithms [1] to detect grapheme cluster, word and sentence boundaries and the Unicode line breaking algorithm [2] to detect line break opportunities. The library is independent from any IO mechanism or Unicode text data structure and it can process text without a complete in-memory representation. Uuseg depends on Uucp and optionally on Uutf for support on OCaml UTF-X encoded strings. It is distributed under the BSD3 license. [1]: http://www.unicode.org/reports/tr29/ [2]: http://www.unicode.org/reports/tr14/ Home page: http://erratique.ch/software/uuseg API Docs: http://erratique.ch/software/uuseg/doc This library is useful if you need to find in Unicode data the user-perceived characters -- grapheme clusters in Unicode terminology -- e.g. for breaking strings, cursor movement, text selection, backspace deletion, or fixed-width layouts of Unicode data (see the end of this email for applications with Format). It can also be used to break text into words and sentences or to detect line break=20=20 opportunities, see again the end of this email. Note that these algorithms are locale-independent and may not work well on all the scripts defined in Unicode or on the actual language you are dealing with. Do not take these outputs as a silver bullet and refer to the standards above for further information about their limitations and how to alleviate them. For that reason the API offers a very crude and low level API --- basically a state machine --- to define your own segmenters so that the same high-level API can be reused with customizations. Because of time constraints the current implementation of the algorithms is not particulary clever (and sometimes ugly). They=20=20 were done manually in an ad-hoc manner. There's room for improvement,=20=20 e.g. to devise more mechanical procedures to get from the rules to implementation -- this would also allow to tap into the Unicode's Common Locale Data Repository that does include locale dependent tailoring for segmentation in a (supposedly) machine readable formats [3]. If you are interested on working on (or in funding) this, get in touch. I do not expect the API to change much, but as usual things may change=20=20 before a 1.0.0. Happy grapheme clustering, Daniel [3] http://www.unicode.org/reports/tr35/tr35-general.html#Segmentations # Uuseg and Format Uuseg can be used to improve OCaml's Format's capabilities on Unicode's alphabetic scripts and symbols. There are a few utility functions in the optional `Uuseg_string` library (you'll need to depend on `Uutf`). The most basic function `Uuseg_string.pp_utf_8` simply instructs the Format engine to treat grapheme clusters as a single character (see below) rather than laying out their byte or decomposed representation like a "%s" format would do: # #require "uuseg.string";; (* Pre-composed =C3=A9 (U+00E9) *) # (* broken *) for i =3D 0 to 76 do Format.printf "%s@," "\xC3\xA9" done;; =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9 =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9 =C3=A9 # for i =3D 0 to 76 do Format.printf "%a@," Uuseg_string.pp_utf_8 "=C3=A9" = done;; =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9 (* Decomposed =C3=A9: e + =C2=B4 (U+0065, U+0301) ) *) # (* broken *) for i =3D 0 to 76 do Format.printf "%s@," "\x65\xCC\x81" don= e;; =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9 =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9 =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9 =C3=A9=C3=A9 # for i =3D 0 to 76 do Format.printf "%a@," Uuseg_string.pp_utf_8 "\x65\xCC= \x81" done;; =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3= =A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9=C3=A9= =C3=A9=C3=A9 - : unit =3D () Then we have `Uuseg_string.pp_utf_8_text` which instructs Format about break opportunities and mandatory line break according to the Unicode line breaking algorithm. For example layout some Georgian, the first paragraph of OCaml's wikipedia Georgian page (which btw could benefit of some update if someone on the list has the ability) : http://ka.wikipedia.org/wiki/%E1%83%9D%E1%83%91%E1%83%98%E1%83%94%E1%83%A5%= E1%83%A2%E1%83%A3%E1%83%A0%E1%83%98_%E1%83%99%E1%83%90%E1%83%9B%E1%83%9A%E1= %83%98 # let g =3D "=E1=83=9D=E1=83=91=E1=83=98=E1=83=94=E1=83=A5=E1=83=A2=E1=83=A3=E1=83= =A0=E1=83=98 =E1=83=99=E1=83=90=E1=83=9B=E1=83=9A, =E1=83=9D=E1=83=99=E1=83= =90=E1=83=9B=E1=83=9A (Objective Caml, Ocaml) =E1=83=AC=E1=83=90=E1=83=A0= =E1=83=9B=E1=83=9D=E1=83=90=E1=83=93=E1=83=92=E1=83=94=E1=83=9C=E1=83=A1 = =E1=83=A4=E1=83=A3=E1=83=9C=E1=83=A5=E1=83=AA=E1=83=98=E1=83=A3=E1=83=A0 \ =E1=83=9E=E1=83=A0=E1=83=9D=E1=83=92=E1=83=A0=E1=83=90=E1=83=9B=E1=83= =98=E1=83=A0=E1=83=94=E1=83=91=E1=83=98=E1=83=A1 =E1=83=94=E1=83=9C=E1=83= =90=E1=83=A1, =E1=83=A0=E1=83=9D=E1=83=9B=E1=83=94=E1=83=9A=E1=83=98=E1=83= =AA =E1=83=90=E1=83=A0=E1=83=98=E1=83=A1 =E1=83=A8=E1=83=94=E1=83=A5=E1=83= =9B=E1=83=9C=E1=83=98=E1=83=9A=E1=83=98 =E1=83=9D=E1=83=91=E1=83=98=E1=83= =94=E1=83=A5=E1=83=A2=E1=83=96=E1=83=94 =E1=83=9D=E1=83=A0=E1=83=98=E1=83= =94=E1=83=9C=E1=83=A2=E1=83=98=E1=83=A0=E1=83=94=E1=83=91=E1=83=A3=E1=83=9A= =E1=83=98 \ =E1=83=9E=E1=83=A0=E1=83=9D=E1=83=92=E1=83=A0=E1=83=90=E1=83=9B=E1=83= =98=E1=83=A0=E1=83=94=E1=83=91=E1=83=98=E1=83=A1=E1=83=97=E1=83=95=E1=83=98= =E1=83=A1. =E1=83=98=E1=83=A1 =E1=83=98=E1=83=A7=E1=83=9D =E1=83=93=E1=83= =90=E1=83=AC=E1=83=94=E1=83=A0=E1=83=98=E1=83=9A=E1=83=98 =E1=83=AE=E1=83= =90=E1=83=95=E1=83=98=E1=83=94=E1=83=A0 =E1=83=9A=E1=83=94=E1=83=A0=E1=83= =98=E1=83=9D, =E1=83=AF=E1=83=94=E1=83=A0=E1=83=9D=E1=83=9B =E1=83=95=E1=83= =9D=E1=83=98=E1=83=9A=E1=83=9D=E1=83=9C, \ =E1=83=93=E1=83=90=E1=83=9B=E1=83=98=E1=83=94=E1=83=9C =E1=83=93=E1=83= =9D=E1=83=9A=E1=83=98=E1=83=92=E1=83=94=E1=83=96 =E1=83=93=E1=83=90 =E1=83= =A1=E1=83=AE=E1=83=95=E1=83=94=E1=83=91=E1=83=98=E1=83=A1 =E1=83=9B=E1=83= =98=E1=83=94=E1=83=A0 1996 =E1=83=AC=E1=83=94=E1=83=9A=E1=83=A1. =E1=83=94= =E1=83=9C=E1=83=98=E1=83=A1 =E1=83=A8=E1=83=94=E1=83=A5=E1=83=9B=E1=83=9C= =E1=83=98=E1=83=A1, =E1=83=93=E1=83=90=E1=83=9B=E1=83=A3=E1=83=A8=E1=83=90= =E1=83=95=E1=83=94=E1=83=91=E1=83=98=E1=83=A1 \ =E1=83=93=E1=83=90 =E1=83=92=E1=83=90=E1=83=9C=E1=83=90=E1=83=AE=E1=83= =9A=E1=83=94=E1=83=91=E1=83=98=E1=83=A1 =E1=83=9E=E1=83=A0=E1=83=9D=E1=83= =94=E1=83=A5=E1=83=A2=E1=83=98 =E1=83=A3=E1=83=9B=E1=83=94=E1=83=A2=E1=83= =94=E1=83=A1=E1=83=90=E1=83=93 =E1=83=98=E1=83=9C=E1=83=A0=E1=83=98=E1=83= =90=E1=83=A8=E1=83=98 =E1=83=AE=E1=83=9D=E1=83=A0=E1=83=AA=E1=83=98=E1=83= =94=E1=83=9A=E1=83=93=E1=83=94=E1=83=91=E1=83=90.";; # Format.set_margin 25;; # Format.printf "@[%a@]" Uuseg_string.pp_utf_8_text g;; =E1=83=9D=E1=83=91=E1=83=98=E1=83=94=E1=83=A5=E1=83=A2=E1=83=A3=E1=83=A0=E1= =83=98 =E1=83=99=E1=83=90=E1=83=9B=E1=83=9A, =E1=83=9D=E1=83=99=E1=83=90=E1= =83=9B=E1=83=9A (Objective Caml, Ocaml) =E1=83=AC=E1=83=90=E1=83=A0=E1=83=9B=E1=83=9D=E1=83=90=E1=83=93=E1=83=92=E1= =83=94=E1=83=9C=E1=83=A1 =E1=83=A4=E1=83=A3=E1=83=9C=E1=83=A5=E1=83=AA=E1= =83=98=E1=83=A3=E1=83=A0 =E1=83=9E=E1=83=A0=E1=83=9D=E1=83=92=E1=83=A0=E1=83=90=E1=83=9B=E1=83=98=E1= =83=A0=E1=83=94=E1=83=91=E1=83=98=E1=83=A1 =E1=83=94=E1=83=9C=E1=83=90=E1= =83=A1, =E1=83=A0=E1=83=9D=E1=83=9B=E1=83=94=E1=83=9A=E1=83=98=E1=83=AA =E1=83=90= =E1=83=A0=E1=83=98=E1=83=A1 =E1=83=A8=E1=83=94=E1=83=A5=E1=83=9B=E1=83=9C= =E1=83=98=E1=83=9A=E1=83=98 =E1=83=9D=E1=83=91=E1=83=98=E1=83=94=E1=83=A5=E1=83=A2=E1=83=96=E1=83=94 = =E1=83=9D=E1=83=A0=E1=83=98=E1=83=94=E1=83=9C=E1=83=A2=E1=83=98=E1=83=A0=E1= =83=94=E1=83=91=E1=83=A3=E1=83=9A=E1=83=98 =E1=83=9E=E1=83=A0=E1=83=9D=E1=83=92=E1=83=A0=E1=83=90=E1=83=9B=E1=83=98=E1= =83=A0=E1=83=94=E1=83=91=E1=83=98=E1=83=A1=E1=83=97=E1=83=95=E1=83=98=E1=83= =A1. =E1=83=98=E1=83=A1 =E1=83=98=E1=83=A7=E1=83=9D =E1=83=93=E1=83=90=E1=83=AC=E1=83=94=E1=83=A0= =E1=83=98=E1=83=9A=E1=83=98 =E1=83=AE=E1=83=90=E1=83=95=E1=83=98=E1=83=94= =E1=83=A0 =E1=83=9A=E1=83=94=E1=83=A0=E1=83=98=E1=83=9D, =E1=83=AF=E1=83=94=E1=83=A0= =E1=83=9D=E1=83=9B =E1=83=95=E1=83=9D=E1=83=98=E1=83=9A=E1=83=9D=E1=83=9C, =E1=83=93=E1=83=90=E1=83=9B=E1=83=98=E1=83=94=E1=83=9C =E1=83=93=E1=83=9D= =E1=83=9A=E1=83=98=E1=83=92=E1=83=94=E1=83=96 =E1=83=93=E1=83=90 =E1=83=A1=E1=83=AE=E1=83=95=E1=83=94=E1=83=91=E1=83=98=E1=83=A1 =E1=83=9B= =E1=83=98=E1=83=94=E1=83=A0 1996 =E1=83=AC=E1=83=94=E1=83=9A=E1=83=A1. =E1=83=94=E1=83=9C=E1=83=98=E1=83=A1 =E1=83=A8=E1=83=94=E1=83=A5=E1=83=9B= =E1=83=9C=E1=83=98=E1=83=A1, =E1=83=93=E1=83=90=E1=83=9B=E1=83=A3=E1=83=A8=E1=83=90=E1=83=95=E1=83=94=E1= =83=91=E1=83=98=E1=83=A1 =E1=83=93=E1=83=90 =E1=83=92=E1=83=90=E1=83=9C=E1=83=90=E1=83=AE=E1=83=9A=E1=83=94=E1=83=91=E1= =83=98=E1=83=A1 =E1=83=9E=E1=83=A0=E1=83=9D=E1=83=94=E1=83=A5=E1=83=A2=E1= =83=98 =E1=83=A3=E1=83=9B=E1=83=94=E1=83=A2=E1=83=94=E1=83=A1=E1=83=90=E1=83=93 = =E1=83=98=E1=83=9C=E1=83=A0=E1=83=98=E1=83=90=E1=83=A8=E1=83=98 =E1=83=AE=E1=83=9D=E1=83=A0=E1=83=AA=E1=83=98=E1=83=94=E1=83=9A=E1=83=93=E1= =83=94=E1=83=91=E1=83=90. Soft hyphens are handled by the line breaking algorithm however: 1. Your rendering software may sadly print them which defeats the purpose (e.g. macosx's terminal). 2. It's not possible to replace the soft hyphen by a hard one if the line gets broken at that point, as there's no provision for this in Form= at (which would also be useful to pp outputs that have line continuation characters). Though I guess `Format.pp_set_formatter_output_functions` c= ould be investigated for that. # Format.set_margin 10;; # let h =3D "hy\xC2\xADphen\xC2\xADat\xC2\xADed";; # Format.printf "@[%a@]" Uuseg_string.pp_utf_8_text h;; hy=C2=ADphen=C2=AD at=C2=ADed