From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 7FCE1BB84 for ; Tue, 18 Apr 2006 22:20:15 +0200 (CEST) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.183]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id k3IKKFHA014215 for ; Tue, 18 Apr 2006 22:20:15 +0200 Received: from [84.58.142.150] (helo=gate.lan.gerd-stolpmann.de) by mrelayeu.kundenserver.de (node=mrelayeu3) with ESMTP (Nemesis), id 0MKxQS-1FVwgG2YqV-0006kr; Tue, 18 Apr 2006 22:20:13 +0200 Received: from flakew.lan.gerd-stolpmann.de (flakew.lan.gerd-stolpmann.de [192.168.0.32]) by gate.lan.gerd-stolpmann.de (Postfix) with ESMTP id 6BF7EC113; Tue, 18 Apr 2006 22:20:12 +0200 (CEST) Subject: Re: [Caml-list] migrate from ocamllex to ulex From: Gerd Stolpmann To: Ruslan Kosolapov Cc: caml-list@yquem.inria.fr In-Reply-To: <87u08ribb9.fsf@kosolapov.plesk.ru> References: <87u08ribb9.fsf@kosolapov.plesk.ru> Content-Type: text/plain Date: Tue, 18 Apr 2006 22:20:11 +0200 Message-Id: <1145391612.15442.115.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Provags-ID: kundenserver.de abuse@kundenserver.de login:a6865a839c0178d9aa0ce41878507ea2 X-Miltered: at concorde with ID 444549FF.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; ocamllex:01 gerd:01 stolpmann:01 ocamllex:01 lexer:01 mll:01 lexer:01 mll:01 byte:01 pxp:01 non-trivial:01 ocaml:01 syntax:01 grammar:01 o'caml:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 Am Dienstag, den 18.04.2006, 14:37 +0700 schrieb Ruslan Kosolapov: > I want to use Polygen (http://polygen.org/web/), but this tool is not > work with UTF-8 (if I try to use UTF-8 symbols in template, error > "illegal character" appear). > > As far as I understand problem is ocamllex - if I use UTF-8 symbols in > lexer.mll, ocamllex say to me "illegal character", so, I can't just > modify lexer.mll. Well, ocamllex just processes bytes. In order to scan UTF-8, just must create a regular expression that matches the byte representation. I did this with great success for PXP - but it is absolutely non-trivial. Better go with ulex. > So, I think I should modify Polygen to ulex using. > > I have no any OCaml expirience, so such task is hard for me. Probably. > I look for code examples or any detailed documentation which show me > how I can migrate from ocamllex to ulex. It is not that complicated. The main difference is not that ulex is Unicode-based, but that ulex is a different kind of preprocessor. That has consequences for how the preprocessor is invoked, and for the syntax of the scanner. ocamllex is a classical preprocessor that produces an intermediate file which is then compiled. In contrast to that, ulex modifies the grammar of the O'Caml language such that new constructs can be used. These constructs are immediately mapped to the built-in elements of the language, so it is actually a preprocessor, but much better integrated. In order to run ulex, I strongly recommend to first install findlib (http://ocaml-programming.de/packages). Then, do mv lexer.mll lexer.ml - as ulex does not create intermediate files, there is no need for the .mll extension. Compile with ocamlfind ocamlc -package ulex -syntax camlp4o or ocamlfind ocamlopt -package ulex -syntax camlp4o for the native-code compiler. are the same arguments as for plain ocamlc/ocamlopt. When linking the executable, also add the flag -linkpkg to the compiler invocations. You can simply use these compiler commands for all .ml and .mli files. Of course, you must also modify lexer.ml. In principle, transform {
} rule ... = parse { } | { } ... { } to:
let ... = lexer -> | -> ... ;; This is the purely syntactic part of the transformation. Furthermore, typing is a bit different. ocamllex uses the helper module Lexing. For example, to get the just scanned phrase, you can use the function call Lexing.lexeme lexbuf within one of the s. lexbuf is the buffer the lexer operates on. ulex needs another type of buffer, suitable for Unicode. The module Ulexing provides such a buffer. However, typing is different. The corresponding call Ulexing.lexeme lexbuf returns the phrase, but not as string (O'Caml strings are simply sequences of 8 bit characters), but as array of integers. Use Ulexing.utf8_lexeme lexbuf to get a string of UTF-8 bytes. You will also see the different typing when you call the generated lexers. For ocamllex, this is something like: let lexbuf = Lexing.from_string "Example string" in lexbuf (where is the name of a lexer). For ulex, this is let lexbuf = Ulexing.from_utf8_string "Example string" in lexbuf Look into ulexing.mli, you can also read from other sources. > Please help :) > > > PS: I tryed to modify file lexer.ml (such file produced by ocamllex), > but I don't know what exactly I should modify - lexer.ml is not > human-readable. Well, this is a finite automaton expressed as lookup table. After the NFA to DFA transformation step, it is practically impossible to understand it. Gerd P.S. Maybe this is also interesting for you: http://www.gerd-stolpmann.de/buero/service_ocaml.html.en -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------