caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Ruslan Kosolapov <rkosolapov@swsoft.com>
Cc: caml-list@yquem.inria.fr
Subject: Re: [Caml-list] migrate from ocamllex to ulex
Date: Tue, 18 Apr 2006 22:20:11 +0200	[thread overview]
Message-ID: <1145391612.15442.115.camel@localhost.localdomain> (raw)
In-Reply-To: <87u08ribb9.fsf@kosolapov.plesk.ru>

Am Dienstag, den 18.04.2006, 14:37 +0700 schrieb Ruslan Kosolapov:
> I want to use Polygen (http://polygen.org/web/), but this tool is not
> work with UTF-8 (if I try to use UTF-8 symbols in template, error
> "illegal character" appear).
> 
> As far as I understand problem is ocamllex - if I use UTF-8 symbols in
> lexer.mll, ocamllex say to me "illegal character", so, I can't just
> modify lexer.mll.

Well, ocamllex just processes bytes. In order to scan UTF-8, just must
create a regular expression that matches the byte representation. I did
this with great success for PXP - but it is absolutely non-trivial.
Better go with ulex.

> So, I think I should modify Polygen to ulex using.
> 
> I have no any OCaml expirience, so such task is hard for me.

Probably.

> I look for code examples or any detailed documentation which show me
> how I can migrate from ocamllex to ulex.

It is not that complicated. The main difference is not that ulex is
Unicode-based, but that ulex is a different kind of preprocessor. That
has consequences for how the preprocessor is invoked, and for the syntax
of the scanner.

ocamllex is a classical preprocessor that produces an intermediate file
which is then compiled. In contrast to that, ulex modifies the grammar
of the O'Caml language such that new constructs can be used. These
constructs are immediately mapped to the built-in elements of the
language, so it is actually a preprocessor, but much better integrated.

In order to run ulex, I strongly recommend to first install findlib
(http://ocaml-programming.de/packages). Then, do
mv lexer.mll lexer.ml - as ulex does not create intermediate files,
there is no need for the .mll extension. Compile with

ocamlfind ocamlc -package ulex -syntax camlp4o <args>

or

ocamlfind ocamlopt -package ulex -syntax camlp4o <args>

for the native-code compiler. <args> are the same arguments as for plain
ocamlc/ocamlopt. When linking the executable, also add the flag -linkpkg
to the compiler invocations.

You can simply use these compiler commands for all .ml and .mli files.

Of course, you must also modify lexer.ml. In principle, transform

{ <header> }

rule <name1> <arg1> <arg2> ... = 
  parse <regexp> { <action> }
      | <regexp> { <action> } ...

{ <trailer> }

to:

<header>

let <name1> <arg1> <arg2> ... =
  lexer <regexp> -> <action>
      | <regexp> -> <action> ...
;;

<trailer>

This is the purely syntactic part of the transformation. Furthermore,
typing is a bit different.

ocamllex uses the helper module Lexing. For example, to get the just
scanned phrase, you can use the function call

Lexing.lexeme lexbuf

within one of the <action>s. lexbuf is the buffer the lexer operates on.
ulex needs another type of buffer, suitable for Unicode. The module
Ulexing provides such a buffer. However, typing is different. The
corresponding call

Ulexing.lexeme lexbuf

returns the phrase, but not as string (O'Caml strings are simply
sequences of 8 bit characters), but as array of integers. Use

Ulexing.utf8_lexeme lexbuf

to get a string of UTF-8 bytes.

You will also see the different typing when you call the generated
lexers. For ocamllex, this is something like:

let lexbuf = Lexing.from_string "Example string" in
<name> lexbuf

(where <name> is the name of a lexer). For ulex, this is

let lexbuf = Ulexing.from_utf8_string "Example string" in
<name> lexbuf

Look into ulexing.mli, you can also read from other sources.

> Please help :)
> 
> 
> PS: I tryed to modify file lexer.ml (such file produced by ocamllex),
> but I don't know what exactly I should modify - lexer.ml is not
> human-readable.

Well, this is a finite automaton expressed as lookup table. After the
NFA to DFA transformation step, it is practically impossible to
understand it.

Gerd

P.S. Maybe this is also interesting for you:
http://www.gerd-stolpmann.de/buero/service_ocaml.html.en

-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------


      parent reply	other threads:[~2006-04-18 20:20 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-04-18  7:37 Ruslan Kosolapov
2006-04-18 17:04 ` [Caml-list] " Tom
2006-04-19  3:54   ` Ruslan Kosolapov
2006-04-18 20:20 ` Gerd Stolpmann [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1145391612.15442.115.camel@localhost.localdomain \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@yquem.inria.fr \
    --cc=rkosolapov@swsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).