caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Case-insensitive lexing
@ 2007-02-23 15:32 Joel Reymont
  2007-02-23 15:41 ` [Caml-list] " Denis Bueno
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Joel Reymont @ 2007-02-23 15:32 UTC (permalink / raw)
  To: caml-list

Folks,

Is there a way to make a case-insensitive lexer with ocamllex?

The only answer I was able to find is the following from John Skaller:

| identifier {
   let x = toupper (lexeme lexbuf) in
   try Hashtbl.find keywords x
   With Not_found -> x
}

	Thanks, Joel

--
http://wagerlabs.com/






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 15:32 Case-insensitive lexing Joel Reymont
@ 2007-02-23 15:41 ` Denis Bueno
       [not found] ` <011EB42A-05E3-4686-BED7-2DB8B2663221@cs.uni-sb.de>
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Denis Bueno @ 2007-02-23 15:41 UTC (permalink / raw)
  To: Joel Reymont; +Cc: caml-list

On 2/23/07, Joel Reymont <joelr1@gmail.com> wrote:
> Folks,
>
> Is there a way to make a case-insensitive lexer with ocamllex?
>
> The only answer I was able to find is the following from John Skaller:

You could also make all your token regular expressions accept all
possible cases. So if `int' were a keyword to lex, your regexp might
be "[iI][nN][tT]". You could (easily?) write something that would
accept a keyword regexp and give you back a case-insensitive keyword
regexp.

It looks like you might have been asking for case-insensitive
comparison for your keyword hashtable. This is quite easy, just by
using a HashedType with a case-insensitive `compare' function.

-Denis


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
       [not found] ` <011EB42A-05E3-4686-BED7-2DB8B2663221@cs.uni-sb.de>
@ 2007-02-23 16:02   ` Joel Reymont
  2007-02-23 19:04     ` Martin Jambon
       [not found]     ` <42EFE4CF-F2F8-4AD2-9909-D0CF027A17AD@cs.uni-sb.de>
  0 siblings, 2 replies; 13+ messages in thread
From: Joel Reymont @ 2007-02-23 16:02 UTC (permalink / raw)
  To: caml-list


On Feb 23, 2007, at 3:50 PM, Christian Lindig wrote:

> I assume you define identifier to a sequence of upper and lower  
> characters. What is wrong with this solution and what would you  
> have liked to see instead?

I was hoping not having to write this (from a Lex file):

A       [Aa]
B       [Bb]
..
Y       [Yy]
Z       [Zz]

{A}{B}{O}{V}{E}                 { return(ABOVE); }
{A}{G}{O}                       { return(AGO); }
{A}{L}{E}{R}{T}                 { return(ALERT); }

Or even uglier

{N}{U}{M}{E}{R}{I}{C}{A}{R}{R}{A}{Y}{R}{E}{F}         ('ARRAY-NUM- 
REF, yytext)
{T}{R}{U}{E}{F}{A}{L}{S}{E}{A}{R}{R}{A}{Y}            ('ARRAY-NUM,  
yytext)
{T}{R}{U}{E}{F}{A}{L}{S}{E}{A}{R}{R}{A}{Y}{R}{E}{F}   ('ARRAY-NUM- 
REF, yytext)

	Thanks, Joel

--
http://wagerlabs.com/






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 16:02   ` Joel Reymont
@ 2007-02-23 19:04     ` Martin Jambon
       [not found]     ` <42EFE4CF-F2F8-4AD2-9909-D0CF027A17AD@cs.uni-sb.de>
  1 sibling, 0 replies; 13+ messages in thread
From: Martin Jambon @ 2007-02-23 19:04 UTC (permalink / raw)
  To: Joel Reymont; +Cc: caml-list

On Fri, 23 Feb 2007, Joel Reymont wrote:

>
> On Feb 23, 2007, at 3:50 PM, Christian Lindig wrote:
>
> > I assume you define identifier to a sequence of upper and lower
> > characters. What is wrong with this solution and what would you
> > have liked to see instead?
>
> I was hoping not having to write this (from a Lex file):
>
> A       [Aa]
> B       [Bb]
> ..
> Y       [Yy]
> Z       [Zz]
>
> {A}{B}{O}{V}{E}                 { return(ABOVE); }
> {A}{G}{O}                       { return(AGO); }
> {A}{L}{E}{R}{T}                 { return(ALERT); }
>
> Or even uglier
>
> {N}{U}{M}{E}{R}{I}{C}{A}{R}{R}{A}{Y}{R}{E}{F}         ('ARRAY-NUM-
> REF, yytext)
> {T}{R}{U}{E}{F}{A}{L}{S}{E}{A}{R}{R}{A}{Y}            ('ARRAY-NUM,
> yytext)
> {T}{R}{U}{E}{F}{A}{L}{S}{E}{A}{R}{R}{A}{Y}{R}{E}{F}   ('ARRAY-NUM-
> REF, yytext)

The ocamllex version would look like:

let a = ['A' 'a']
let b = ['B' 'b']
...

  a b o v e                 { ABOVE }
| a g o                     { AGO }
| a l e r t                 { ALERT }


Note that in micmatch--which is not ocamllex but supports the same core
syntax--there is a tilde operator which makes the regexp case-insensitive
according to latin1. It doesn't work with every language or encoding
though.


Martin

--
Martin Jambon
http://martin.jambon.free.fr


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 15:32 Case-insensitive lexing Joel Reymont
  2007-02-23 15:41 ` [Caml-list] " Denis Bueno
       [not found] ` <011EB42A-05E3-4686-BED7-2DB8B2663221@cs.uni-sb.de>
@ 2007-02-23 20:21 ` Francois Rouaix
  2007-02-24  2:25 ` skaller
  2007-02-24  2:46 ` Jon Harrop
  4 siblings, 0 replies; 13+ messages in thread
From: Francois Rouaix @ 2007-02-23 20:21 UTC (permalink / raw)
  To: Joel Reymont; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 951 bytes --]

You could wrap existing lexbuf generating functions with a layer that
converts everything to uppercase or lowercase, and write your lexer
accordingly? Won't work if you have constants strings, but otherwise it puts
all the case logic in one place.


On 2/23/07, Joel Reymont <joelr1@gmail.com> wrote:
>
> Folks,
>
> Is there a way to make a case-insensitive lexer with ocamllex?
>
> The only answer I was able to find is the following from John Skaller:
>
> | identifier {
>    let x = toupper (lexeme lexbuf) in
>    try Hashtbl.find keywords x
>    With Not_found -> x
> }
>
>         Thanks, Joel
>
> --
> http://wagerlabs.com/
>
>
>
>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>

[-- Attachment #2: Type: text/html, Size: 1597 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
       [not found]     ` <42EFE4CF-F2F8-4AD2-9909-D0CF027A17AD@cs.uni-sb.de>
@ 2007-02-23 21:03       ` Joel Reymont
  2007-02-23 22:51         ` Erik de Castro Lopo
  0 siblings, 1 reply; 13+ messages in thread
From: Joel Reymont @ 2007-02-23 21:03 UTC (permalink / raw)
  To: caml-list

Christian,

On Feb 23, 2007, at 4:50 PM, Christian Lindig wrote:

> Ugly, indeed. The solution you have posted before is general better  
> and the approved way for identifying keywords in a Caml lexer: have  
> a rule for idents and look up the keywords in a hash table. Make  
> these case insensitive is then easy.

Is this better than Martin's version of

let a = ['a' 'A']
...
| a b o v e { ABOVE }

?

Is there a significant advantage to keeping keywords in a hash table?

	Thanks, Joel

--
http://wagerlabs.com/






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 21:03       ` Joel Reymont
@ 2007-02-23 22:51         ` Erik de Castro Lopo
  2007-02-23 23:00           ` Jon Harrop
  0 siblings, 1 reply; 13+ messages in thread
From: Erik de Castro Lopo @ 2007-02-23 22:51 UTC (permalink / raw)
  To: caml-list

Joel Reymont wrote:

> Is there a significant advantage to keeping keywords in a hash table?

Very definitely. Ocamlex has some limitations on the size of the
tables it can generate and those limits are relatively easy to
hit :-).

Moving all the keywords into a hashtable, and using that to differentiate
between keywords and identifiers actually works very well.

Cheers,
Erik
-- 
+-----------------------------------------------------------+
  Erik de Castro Lopo
+-----------------------------------------------------------+
"C makes it easy to shoot yourself in the foot. C++ makes it 
harder, but when you do, you blow away your whole leg!"
-- Bjarne Stroustrup


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 22:51         ` Erik de Castro Lopo
@ 2007-02-23 23:00           ` Jon Harrop
  2007-02-23 23:39             ` Tom
  0 siblings, 1 reply; 13+ messages in thread
From: Jon Harrop @ 2007-02-23 23:00 UTC (permalink / raw)
  To: caml-list

On Friday 23 February 2007 22:51, Erik de Castro Lopo wrote:
> Moving all the keywords into a hashtable, and using that to differentiate
> between keywords and identifiers actually works very well.

I never understood why people use a hash table rather than a pattern match in 
OCaml.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 23:00           ` Jon Harrop
@ 2007-02-23 23:39             ` Tom
  2007-02-24  0:32               ` Jon Harrop
  0 siblings, 1 reply; 13+ messages in thread
From: Tom @ 2007-02-23 23:39 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

>
>
> > Moving all the keywords into a hashtable, and using that to
> differentiate
> > between keywords and identifiers actually works very well.
>
> I never understood why people use a hash table rather than a pattern match
> in
> OCaml.
>

Because, at least what I think, comparing strings using a hash-table is way
faster than comparing them using pattern-matching.

- Tom

[-- Attachment #2: Type: text/html, Size: 552 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 23:39             ` Tom
@ 2007-02-24  0:32               ` Jon Harrop
  2007-03-17 19:55                 ` Oliver Bandel
  0 siblings, 1 reply; 13+ messages in thread
From: Jon Harrop @ 2007-02-24  0:32 UTC (permalink / raw)
  To: caml-list

On Friday 23 February 2007 23:39, you wrote:
> Because, at least what I think, comparing strings using a hash-table is way
> faster than comparing them using pattern-matching.

I had always assumed that OCaml would build an optimal dispatch table when 
pattern matching over strings but it seems you are quite right: it does the 
worst possible linear string equality tests, not even O(log n) comparisons.

That seems like a great shame to me. Can someone implement this?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 15:32 Case-insensitive lexing Joel Reymont
                   ` (2 preceding siblings ...)
  2007-02-23 20:21 ` Francois Rouaix
@ 2007-02-24  2:25 ` skaller
  2007-02-24  2:46 ` Jon Harrop
  4 siblings, 0 replies; 13+ messages in thread
From: skaller @ 2007-02-24  2:25 UTC (permalink / raw)
  To: Joel Reymont; +Cc: caml-list

On Fri, 2007-02-23 at 15:32 +0000, Joel Reymont wrote:
> Folks,
> 
> Is there a way to make a case-insensitive lexer with ocamllex?
> 
> The only answer I was able to find is the following from John Skaller:
> 
> | identifier {
>    let x = toupper (lexeme lexbuf) in
>    try Hashtbl.find keywords x
>    With Not_found -> x
> }
> 

You should note there is another reason for this technique,
which isn't really related to case insensitivity.

A lexer in which you spell out all the keywords like:

| "let"  { LET }
| "in"   { IN }
| "[Cc][Aa][Ss][Ee]" -> { CASE }
....

is likely to blow ocamllex out of the water if you have
a lot of keywords. This is a good thing, because the 
generated code would have a transition matrix so large 
it would blow your program out of the water. 

(** but it would be good if it produced a warning instead 
of looping forever .. can I suggest a command line argument
which limits the number of states? 
eg ocamllex -maxtstates 1000 ...)


Using a hash table here retains roughly O(1) performance,
and may actually be *faster* than using a transition matrix
because the large matrix would probably cause disk paging
for every character.

As an example, I once tried to decode UTF-8, which caused
ocamllex to explode. The encoding is finite and therefore
regular .. but trying to capture the value in the lexer
state isn't the best way to do this because there are
too many states.

If you have a nasty lexicology it's a good idea to factor
it if possible, and implement 'sublexers' using different
techniques. Note you can actually tail-call ocamllex lexers
in their action codes, and, ocamllex lexers allow arguments,
so you can actually pass complex state around. You might also
note recent Ocaml's provide substring extraction using 
Ville Laurikari's tagged automata technique.

If you want an EXTREME example of this -- the Felix lexer
can actually call the Felix parser and wrap a whole AST
up inside a single token. This is used to open the grammar
up to support user defined syntax extensions: the extensions
are actually parsed with an executable recursive descent
parser .. but the core constructions are still parsed
by ocamlyacc LALR(1) parsing, so the core language retains
good error diagnostics and high performance. The whole
of my code is entirely re-entrant (no global state).

So actually, although it isn't advertised very heavily,
ocamllex has quite advanced combinatorial properties,
whereas traditional 'lex' lexers are considerably more 
monolithic.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-23 15:32 Case-insensitive lexing Joel Reymont
                   ` (3 preceding siblings ...)
  2007-02-24  2:25 ` skaller
@ 2007-02-24  2:46 ` Jon Harrop
  4 siblings, 0 replies; 13+ messages in thread
From: Jon Harrop @ 2007-02-24  2:46 UTC (permalink / raw)
  To: caml-list

On Friday 23 February 2007 15:32, Joel Reymont wrote:
> Is there a way to make a case-insensitive lexer with ocamllex?

Preprocess the character stream, making all chars lower case before they even 
reach ocamllex.

Take the existing definition of Lexing.from_channel:

let from_channel ic =
  from_function (fun buf n -> input ic buf 0 n)

and alter it to change the case of chars as they are read (off the top of my 
head):

let from_case_insensitive_channel ic =
  let aux buf n =
    let i = input ic buf 0 n in
    for i=0 to i-1 do
      buf.[i] <- Char.lowercase buf.[i]
    done;
    i in
  from_function aux

Now create your lexbuf using from_case_insensitive_channel.

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Caml-list] Case-insensitive lexing
  2007-02-24  0:32               ` Jon Harrop
@ 2007-03-17 19:55                 ` Oliver Bandel
  0 siblings, 0 replies; 13+ messages in thread
From: Oliver Bandel @ 2007-03-17 19:55 UTC (permalink / raw)
  To: caml-list

On Sat, Feb 24, 2007 at 12:32:18AM +0000, Jon Harrop wrote:
> On Friday 23 February 2007 23:39, you wrote:
> > Because, at least what I think, comparing strings using a hash-table is way
> > faster than comparing them using pattern-matching.
> 
> I had always assumed that OCaml would build an optimal dispatch table when 
> pattern matching over strings but it seems you are quite right: it does the 
> worst possible linear string equality tests, not even O(log n) comparisons.
> 
> That seems like a great shame to me. Can someone implement this?
[...]

=> Feature-wish in Ocaml-Bug-Tracker?


Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-03-17 19:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-23 15:32 Case-insensitive lexing Joel Reymont
2007-02-23 15:41 ` [Caml-list] " Denis Bueno
     [not found] ` <011EB42A-05E3-4686-BED7-2DB8B2663221@cs.uni-sb.de>
2007-02-23 16:02   ` Joel Reymont
2007-02-23 19:04     ` Martin Jambon
     [not found]     ` <42EFE4CF-F2F8-4AD2-9909-D0CF027A17AD@cs.uni-sb.de>
2007-02-23 21:03       ` Joel Reymont
2007-02-23 22:51         ` Erik de Castro Lopo
2007-02-23 23:00           ` Jon Harrop
2007-02-23 23:39             ` Tom
2007-02-24  0:32               ` Jon Harrop
2007-03-17 19:55                 ` Oliver Bandel
2007-02-23 20:21 ` Francois Rouaix
2007-02-24  2:25 ` skaller
2007-02-24  2:46 ` Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).