From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from discorde.inria.fr (discorde.inria.fr [192.93.2.38]) by yquem.inria.fr (Postfix) with ESMTP id 2368ABC69 for ; Sat, 24 Feb 2007 03:30:43 +0100 (CET) Received: from ipmail01.adl2.internode.on.net (ipmail01.adl2.internode.on.net [203.16.214.140]) by discorde.inria.fr (8.13.6/8.13.6) with ESMTP id l1O2UeJl030076 for ; Sat, 24 Feb 2007 03:30:42 +0100 Received: from ppp41-119.lns2.syd6.internode.on.net (HELO rosella) ([59.167.41.119]) by ipmail01.adl2.internode.on.net with ESMTP; 24 Feb 2007 12:55:32 +1030 X-IronPort-AV: i="4.14,213,1170595800"; d="scan'208"; a="92714781:sNHT42642117" Subject: Re: [Caml-list] Case-insensitive lexing From: skaller To: Joel Reymont Cc: caml-list@inria.fr In-Reply-To: <23E977DC-77FF-44A2-8675-5EAA8F61505F@gmail.com> References: <23E977DC-77FF-44A2-8675-5EAA8F61505F@gmail.com> Content-Type: text/plain Date: Sat, 24 Feb 2007 13:25:29 +1100 Message-Id: <1172283929.23720.36.camel@rosella.wigram> Mime-Version: 1.0 X-Mailer: Evolution 2.6.1 Content-Transfer-Encoding: 7bit X-Miltered: at discorde with ID 45DFA350.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; lexing:01 lexer:01 ocamllex:01 lexeme:01 lexbuf:01 hashtbl:01 lexer:01 ocamllex:01 hash:01 tail-call:01 lexers:01 lexers:01 ocaml's:01 parser:01 syntax:01 On Fri, 2007-02-23 at 15:32 +0000, Joel Reymont wrote: > Folks, > > Is there a way to make a case-insensitive lexer with ocamllex? > > The only answer I was able to find is the following from John Skaller: > > | identifier { > let x = toupper (lexeme lexbuf) in > try Hashtbl.find keywords x > With Not_found -> x > } > You should note there is another reason for this technique, which isn't really related to case insensitivity. A lexer in which you spell out all the keywords like: | "let" { LET } | "in" { IN } | "[Cc][Aa][Ss][Ee]" -> { CASE } .... is likely to blow ocamllex out of the water if you have a lot of keywords. This is a good thing, because the generated code would have a transition matrix so large it would blow your program out of the water. (** but it would be good if it produced a warning instead of looping forever .. can I suggest a command line argument which limits the number of states? eg ocamllex -maxtstates 1000 ...) Using a hash table here retains roughly O(1) performance, and may actually be *faster* than using a transition matrix because the large matrix would probably cause disk paging for every character. As an example, I once tried to decode UTF-8, which caused ocamllex to explode. The encoding is finite and therefore regular .. but trying to capture the value in the lexer state isn't the best way to do this because there are too many states. If you have a nasty lexicology it's a good idea to factor it if possible, and implement 'sublexers' using different techniques. Note you can actually tail-call ocamllex lexers in their action codes, and, ocamllex lexers allow arguments, so you can actually pass complex state around. You might also note recent Ocaml's provide substring extraction using Ville Laurikari's tagged automata technique. If you want an EXTREME example of this -- the Felix lexer can actually call the Felix parser and wrap a whole AST up inside a single token. This is used to open the grammar up to support user defined syntax extensions: the extensions are actually parsed with an executable recursive descent parser .. but the core constructions are still parsed by ocamlyacc LALR(1) parsing, so the core language retains good error diagnostics and high performance. The whole of my code is entirely re-entrant (no global state). So actually, although it isn't advertised very heavily, ocamllex has quite advanced combinatorial properties, whereas traditional 'lex' lexers are considerably more monolithic. -- John Skaller Felix, successor to C++: http://felix.sf.net