Re: [Caml-list] Case-insensitive lexing

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: skaller <skaller@users.sourceforge.net>
To: Joel Reymont <joelr1@gmail.com>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Case-insensitive lexing
Date: Sat, 24 Feb 2007 13:25:29 +1100	[thread overview]
Message-ID: <1172283929.23720.36.camel@rosella.wigram> (raw)
In-Reply-To: <23E977DC-77FF-44A2-8675-5EAA8F61505F@gmail.com>

On Fri, 2007-02-23 at 15:32 +0000, Joel Reymont wrote:
> Folks,
> 
> Is there a way to make a case-insensitive lexer with ocamllex?
> 
> The only answer I was able to find is the following from John Skaller:
> 
> | identifier {
>    let x = toupper (lexeme lexbuf) in
>    try Hashtbl.find keywords x
>    With Not_found -> x
> }
> 

You should note there is another reason for this technique,
which isn't really related to case insensitivity.

A lexer in which you spell out all the keywords like:

| "let"  { LET }
| "in"   { IN }
| "[Cc][Aa][Ss][Ee]" -> { CASE }
....

is likely to blow ocamllex out of the water if you have
a lot of keywords. This is a good thing, because the 
generated code would have a transition matrix so large 
it would blow your program out of the water. 

(** but it would be good if it produced a warning instead 
of looping forever .. can I suggest a command line argument
which limits the number of states? 
eg ocamllex -maxtstates 1000 ...)

Using a hash table here retains roughly O(1) performance,
and may actually be *faster* than using a transition matrix
because the large matrix would probably cause disk paging
for every character.

As an example, I once tried to decode UTF-8, which caused
ocamllex to explode. The encoding is finite and therefore
regular .. but trying to capture the value in the lexer
state isn't the best way to do this because there are
too many states.

If you have a nasty lexicology it's a good idea to factor
it if possible, and implement 'sublexers' using different
techniques. Note you can actually tail-call ocamllex lexers
in their action codes, and, ocamllex lexers allow arguments,
so you can actually pass complex state around. You might also
note recent Ocaml's provide substring extraction using 
Ville Laurikari's tagged automata technique.

If you want an EXTREME example of this -- the Felix lexer
can actually call the Felix parser and wrap a whole AST
up inside a single token. This is used to open the grammar
up to support user defined syntax extensions: the extensions
are actually parsed with an executable recursive descent
parser .. but the core constructions are still parsed
by ocamlyacc LALR(1) parsing, so the core language retains
good error diagnostics and high performance. The whole
of my code is entirely re-entrant (no global state).

So actually, although it isn't advertised very heavily,
ocamllex has quite advanced combinatorial properties,
whereas traditional 'lex' lexers are considerably more 
monolithic.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

next prev parent reply	other threads:[~2007-02-24  2:30 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-23 15:32 Joel Reymont
2007-02-23 15:41 ` [Caml-list] " Denis Bueno
     [not found] ` <011EB42A-05E3-4686-BED7-2DB8B2663221@cs.uni-sb.de>
2007-02-23 16:02   ` Joel Reymont
2007-02-23 19:04     ` Martin Jambon
     [not found]     ` <42EFE4CF-F2F8-4AD2-9909-D0CF027A17AD@cs.uni-sb.de>
2007-02-23 21:03       ` Joel Reymont
2007-02-23 22:51         ` Erik de Castro Lopo
2007-02-23 23:00           ` Jon Harrop
2007-02-23 23:39             ` Tom
2007-02-24  0:32               ` Jon Harrop
2007-03-17 19:55                 ` Oliver Bandel
2007-02-23 20:21 ` Francois Rouaix
2007-02-24  2:25 ` skaller [this message]
2007-02-24  2:46 ` Jon Harrop

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1172283929.23720.36.camel@rosella.wigram \
    --to=skaller@users.sourceforge.net \
    --cc=caml-list@inria.fr \
    --cc=joelr1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).