caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] ocamllex+ocamlyacc and not parsing properly
@ 2005-08-07 21:39 Jonathan Roewen
  2005-08-07 21:54 ` Jonathan Roewen
  2005-08-07 23:58 ` Jon Harrop
  0 siblings, 2 replies; 9+ messages in thread
From: Jonathan Roewen @ 2005-08-07 21:39 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

Hi,

I'm having some trouble with a lexer+parser I've written to parse IRC
strings. Just about all strings are parsed correctly, but I'm having a
few minor issues.

Here are two strings that fail to parse correctly:
:Sovereign.Wyldryde.org 254 dst 112 :holodeck programs running

:Sovereign.Wyldryde.org 333 dst #bfos Helio 112025589

My irc_lexer.mll & irc_parser.mly are attached :-) (don't wanna spam
the list with huge lines of code)

Used in ircii.ml as:

let input_line = IO.read_line instream in
  try match (Irc_parser.args Irc_lexer.message (Lexing.from_string
input_line)) with

If someone can highlight what's going wrong, I'd really appreciate it
(though, it doesn't currently affect any received irc commands of
great significance at the moment).

BTW: As an aside, if the lexer doesn't cover all the bases, it doesn't
throw an exception, just screws up my OS (Bounds check error, followed
by seg-fault).

Jonathan

[-- Attachment #2: irc_lexer.mll --]
[-- Type: text/plain, Size: 1106 bytes --]

{
open Lexing
open Irc_parser
open Irc_types
}

let letter = [^' ']
let digit = ['0'-'9']

rule param = parse
	| ':'((letter|' ')* as s)			{ STRING s }
	| (letter+) as s					{ STRING s }
	| [' ']+							{ param lexbuf }
	| eof								{ EOL }

and command = parse
	| (digit digit digit) as num		{ COMMAND (Numeric (int_of_string num)) }
	| "JOIN"							{ COMMAND JOIN }
	| "PART"							{ COMMAND PART }
	| "MODE"							{ COMMAND MODE }
	| "TOPIC"							{ COMMAND TOPIC }
	| "NAMES"							{ COMMAND NAMES }
	| "LIST"							{ COMMAND LIST }
	| "INVITE"							{ COMMAND INVITE }
	| "KICK"							{ COMMAND KICK }
	| "PRIVMSG"							{ COMMAND PRIVMSG }
	| "NOTICE"							{ COMMAND NOTICE }
	| "QUIT"							{ COMMAND QUIT }
	| "PING"							{ COMMAND PING }
	| [' ']*							{ param lexbuf } (* this means we won't get a command, and parsing will fail *)
	| eof								{ EOL } (* this will raise a parsing error *)

and message = parse
	| ':'((letter+) as s)				{ STRING s }
	| [' ']*							{ command lexbuf }
	| eof								{ EOL } (* and this will also raise a parsing error *)

[-- Attachment #3: irc_parser.mly --]
[-- Type: text/plain, Size: 335 bytes --]

%token EOL
%token <string> STRING /* this is for parameters */
%token <Irc_types.command> COMMAND

%start args
%type <string option * Irc_types.command * string list> args
%%

params:
	| STRING params { $1 :: $2 }
	| EOL { [] }
	
args:
	| STRING COMMAND params { (Some $1, $2, $3) }
	| COMMAND params { (None, $1, $2) }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-07 21:39 [Caml-list] ocamllex+ocamlyacc and not parsing properly Jonathan Roewen
@ 2005-08-07 21:54 ` Jonathan Roewen
  2005-08-07 23:58 ` Jon Harrop
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Roewen @ 2005-08-07 21:54 UTC (permalink / raw)
  To: caml-list

I should add what the expected format of the IRC string is:

[':' <from> <space>] <command> { <space> <param> } [ <space> ':'<last-param> ]

from := letters not including space
command = digit digit digit OR uppercase letters only
param = letters not including spaces
last-param = any letters including spaces

Jonathan

On 8/8/05, Jonathan Roewen <jonathan.roewen@gmail.com> wrote:
> Hi,
> 
> I'm having some trouble with a lexer+parser I've written to parse IRC
> strings. Just about all strings are parsed correctly, but I'm having a
> few minor issues.
> 
> Here are two strings that fail to parse correctly:
> :Sovereign.Wyldryde.org 254 dst 112 :holodeck programs running
> 
> :Sovereign.Wyldryde.org 333 dst #bfos Helio 112025589
> 
> My irc_lexer.mll & irc_parser.mly are attached :-) (don't wanna spam
> the list with huge lines of code)
> 
> Used in ircii.ml as:
> 
> let input_line = IO.read_line instream in
>  try match (Irc_parser.args Irc_lexer.message (Lexing.from_string
> input_line)) with
> 
> If someone can highlight what's going wrong, I'd really appreciate it
> (though, it doesn't currently affect any received irc commands of
> great significance at the moment).
> 
> BTW: As an aside, if the lexer doesn't cover all the bases, it doesn't
> throw an exception, just screws up my OS (Bounds check error, followed
> by seg-fault).
> 
> Jonathan
> 
> 
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-07 21:39 [Caml-list] ocamllex+ocamlyacc and not parsing properly Jonathan Roewen
  2005-08-07 21:54 ` Jonathan Roewen
@ 2005-08-07 23:58 ` Jon Harrop
  2005-08-08  2:17   ` Jonathan Roewen
  1 sibling, 1 reply; 9+ messages in thread
From: Jon Harrop @ 2005-08-07 23:58 UTC (permalink / raw)
  To: caml-list

On Sunday 07 August 2005 22:39, Jonathan Roewen wrote:
> I'm having some trouble with a lexer+parser I've written to parse IRC
> strings. Just about all strings are parsed correctly, but I'm having a
> few minor issues.
>
> Here are two strings that fail to parse correctly:
> :Sovereign.Wyldryde.org 254 dst 112 :holodeck programs running
> :
> :Sovereign.Wyldryde.org 333 dst #bfos Helio 112025589

I just added "irc_types.ml":

type command = JOIN | PART | MODE | TOPIC | NAMES | LIST | INVITE
               | KICK | PRIVMSG | NOTICE | QUIT | PING | Numeric of int

and compiled with:

ocamllex irc_lexer.mll
ocamlyacc irc_parser.mly
ocamlc -c irc_types.ml irc_parser.mli irc_parser.ml irc_lexer.ml
ocamlmktop irc_types.cmo irc_parser.cmo irc_lexer.cmo -o irc.top

ran the custom top-level with "./irc.top" and asked it to lex the first of 
your example strings:

# let lexbuf = Lexing.from_string ":Sovereign.Wyldryde.org 254 dst 
112:holodeck programs running";;
val lexbuf : Lexing.lexbuf =
  {Lexing.refill_buff = <fun>;
   Lexing.lex_buffer =
    ":Sovereign.Wyldryde.org 254 dst 112 :holodeck programs running";
   Lexing.lex_buffer_len = 62; Lexing.lex_abs_pos = 0;
   Lexing.lex_start_pos = 0; Lexing.lex_curr_pos = 0;
   Lexing.lex_last_pos = 0; Lexing.lex_last_action = 0;
   Lexing.lex_eof_reached = true; Lexing.lex_mem = [||];
   Lexing.lex_start_p =
    {Lexing.pos_fname = ""; Lexing.pos_lnum = 1; Lexing.pos_bol = 0;
     Lexing.pos_cnum = 0};
   Lexing.lex_curr_p =
    {Lexing.pos_fname = ""; Lexing.pos_lnum = 1; Lexing.pos_bol = 0;
     Lexing.pos_cnum = 0}}
# Irc_lexer.message lexbuf;;
- : Irc_parser.token = Irc_parser.STRING "Sovereign.Wyldryde.org"
# Irc_lexer.message lexbuf;;
- : Irc_parser.token = Irc_parser.COMMAND (Irc_types.Numeric 254)
# Irc_lexer.message lexbuf;;
- : Irc_parser.token = Irc_parser.STRING "dst"
# Irc_lexer.message lexbuf;;
- : Irc_parser.token = Irc_parser.COMMAND (Irc_types.Numeric 112)
# Irc_lexer.message lexbuf;;
- : Irc_parser.token = Irc_parser.STRING "holodeck programs running"
# Irc_lexer.message lexbuf;;
- : Irc_parser.token = Irc_parser.EOL

So you're lexer is emitting the tokens str, com, str, com, str, eol but your 
parser looks as though it is expecting str, com, str, str, str, eol.

I'm guessing the error is in the lexer because the grammar in the parser is 
very simple. So ":Sovereign.Wyldryde.org" is lexed by "message" into str, " " 
then invokes "command" which parses 254 into com, " " then invokes "param" 
which parses "dst" into str, "param" then invokes the remaining into strs.

However, that can't be correct because the lexer has clearly gone back into 
"command" in order to emit "Irc_types.Numeric 112".

It's just a guess, but have you assumed that each time the lexer is invoked by 
the parser that it starts in the rule it was left in when, in fact, the 
parser invokes the "message" rule every time?

> BTW: As an aside, if the lexer doesn't cover all the bases, it doesn't
> throw an exception, just screws up my OS (Bounds check error, followed
> by seg-fault).

Any idea what is causing the segfault?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-07 23:58 ` Jon Harrop
@ 2005-08-08  2:17   ` Jonathan Roewen
  2005-08-08  4:23     ` Jonathan Roewen
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Roewen @ 2005-08-08  2:17 UTC (permalink / raw)
  Cc: caml-list

> So you're lexer is emitting the tokens str, com, str, com, str, eol but your
> parser looks as though it is expecting str, com, str, str, str, eol.
> 
> I'm guessing the error is in the lexer because the grammar in the parser is
> very simple. So ":Sovereign.Wyldryde.org" is lexed by "message" into str, " "
> then invokes "command" which parses 254 into com, " " then invokes "param"
> which parses "dst" into str, "param" then invokes the remaining into strs.
> 
> However, that can't be correct because the lexer has clearly gone back into
> "command" in order to emit "Irc_types.Numeric 112".
> 
> It's just a guess, but have you assumed that each time the lexer is invoked by
> the parser that it starts in the rule it was left in when, in fact, the
> parser invokes the "message" rule every time?

Yes, I assume it continues lexing where it left off. If this is not
the case, what's the best way to do conditional parsing??

Jonathan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-08  2:17   ` Jonathan Roewen
@ 2005-08-08  4:23     ` Jonathan Roewen
  2005-08-08  5:03       ` Jonathan Roewen
  2005-08-08  8:59       ` skaller
  0 siblings, 2 replies; 9+ messages in thread
From: Jonathan Roewen @ 2005-08-08  4:23 UTC (permalink / raw)
  To: caml-list

Is there any way to call another rule based on some variable in
ocamllex? I see you can pass arguments to a rule, but what use are
these except in the actions part?

I have two cases:

if (and only if) it's the first token, ':' starts a token with no
spaces. and if (and only if) it's the last token, ':' starts a token
that consumes the rest of the input line.

I'm completely lost how to do this in light of the fact that the lexer
starts at the first rule for each parsed token =(

Is it possible to restart parsing from the start of the matched lexeme
using an action?

Jonathan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-08  4:23     ` Jonathan Roewen
@ 2005-08-08  5:03       ` Jonathan Roewen
  2005-08-08  6:39         ` Jon Harrop
  2005-08-08  8:59       ` skaller
  1 sibling, 1 reply; 9+ messages in thread
From: Jonathan Roewen @ 2005-08-08  5:03 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 536 bytes --]

Well, using a ref to maintain state seemed to fix things =) However, I
do have a question on how it manages to parse the last token in an IRC
string correctly.

Given the following input:

":journey PRIVMSG #bfos :\001ACTION dances a jig\001"

How does my irc_lexer.mll match the rule in param, and not the rule in message?

(My new irc_lexer.mll is attached).

BTW: A tutorial on these sorts of things (i.e. something a bit more
complicated than the calculator example) would be great to have as a
resource.

Jonathan

[-- Attachment #2: irc_lexer.mll --]
[-- Type: text/plain, Size: 1291 bytes --]

{
open Lexing
open Irc_parser
open Irc_types

type state = Header | Command | Params

let state = ref Header

let init () = state := Header (* this -must- be called before trying to lex a complete line *)
}

let letter = [^' ']
let digit = ['0'-'9']

rule param = parse
	| ':'((letter|' ')* as s)		{ STRING s }
	| (letter+) as s				{ STRING s }
	| [' ']+						{ param lexbuf }
	| eof							{ EOL }

and command = parse
	| (digit digit digit) as num	{ COMMAND (Numeric (int_of_string num)) }
	| "JOIN"						{ COMMAND JOIN }
	| "PART"						{ COMMAND PART }
	| "MODE"						{ COMMAND MODE }
	| "TOPIC"						{ COMMAND TOPIC }
	| "NAMES"						{ COMMAND NAMES }
	| "LIST"						{ COMMAND LIST }
	| "INVITE"						{ COMMAND INVITE }
	| "KICK"						{ COMMAND KICK }
	| "PRIVMSG"						{ COMMAND PRIVMSG }
	| "NOTICE"						{ COMMAND NOTICE }
	| "QUIT"						{ COMMAND QUIT }
	| "PING"						{ COMMAND PING }
	| eof							{ EOL }

and message = parse
	| ':'((letter+) as s)			{ if !state = Header then state := Command; STRING s }
	| [' ']*						{ if !state = Header || !state = Command then begin
										state := Params;
										command lexbuf
									end else begin
										state := Params;
										param lexbuf
									end }
	| eof							{ EOL }

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-08  5:03       ` Jonathan Roewen
@ 2005-08-08  6:39         ` Jon Harrop
  2005-08-08  6:47           ` Jonathan Roewen
  0 siblings, 1 reply; 9+ messages in thread
From: Jon Harrop @ 2005-08-08  6:39 UTC (permalink / raw)
  To: caml-list

On Monday 08 August 2005 06:03, Jonathan Roewen wrote:
> BTW: A tutorial on these sorts of things (i.e. something a bit more
> complicated than the calculator example) would be great to have as a
> resource.

Given that the grammar is tiny and the lexer has almost as much OCaml code as 
lexing rules, I think this is a good candidate for not using lexing and 
parsing at all. If you wrote this as an OCaml function, how long would it be?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-08  6:39         ` Jon Harrop
@ 2005-08-08  6:47           ` Jonathan Roewen
  0 siblings, 0 replies; 9+ messages in thread
From: Jonathan Roewen @ 2005-08-08  6:47 UTC (permalink / raw)
  To: Jon Harrop; +Cc: caml-list

On 8/8/05, Jon Harrop <jon@ffconsultancy.com> wrote:
> On Monday 08 August 2005 06:03, Jonathan Roewen wrote:
> > BTW: A tutorial on these sorts of things (i.e. something a bit more
> > complicated than the calculator example) would be great to have as a
> > resource.
> 
> Given that the grammar is tiny and the lexer has almost as much OCaml code as
> lexing rules, I think this is a good candidate for not using lexing and
> parsing at all. If you wrote this as an OCaml function, how long would it be?

I have no idea, and don't really want to think about it either.
Perhaps someone could come up with a fancy regex using the Str module
that could do it in a single line, although you'd still need to
generate the types with some big function.

The lexer and parser are straightforward and simple, and they work
(now -- using a ref). Not to mention clear to read :) I think a giant
regex and type generating function would be less clear and
understandable. Sure it'll prolly add about 10Kb using lexer+parser,
but I don't care too much about that.

Jon


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Caml-list] ocamllex+ocamlyacc and not parsing properly
  2005-08-08  4:23     ` Jonathan Roewen
  2005-08-08  5:03       ` Jonathan Roewen
@ 2005-08-08  8:59       ` skaller
  1 sibling, 0 replies; 9+ messages in thread
From: skaller @ 2005-08-08  8:59 UTC (permalink / raw)
  To: Jonathan Roewen; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

On Mon, 2005-08-08 at 16:23 +1200, Jonathan Roewen wrote:
> Is there any way to call another rule based on some variable in
> ocamllex? I see you can pass arguments to a rule, but what use are
> these except in the actions part?

You are trying to do too much work in the wrong
places IMHO.

Your lexer should always be context insensitive
and build small sensible pretokens NOT tokens:

identifier
integer
whitespace
newline
: # .

Then postprocess these pretokens into tokens,
this is easiest with a list where you can use
pattern matching and functional techniques to
look ahead.

Then parse the tokens, this is easy because you just
choose the tokens to make it easy :)

Use the grammar production arguments $1 $2 ...
to do further processing in the action as required.

The point is: the easy stuff is done by the two
automata (lexer, parser) and the hard stuff
is done in OCAML code.

For example: Felix lexer generates these pretokens:

WHITESPACE NEWLINE COMMENT

which are NOT tokens of the grammar. These tokens
are stripped out by the preprocessor. You may actually
do this:

let pack_names tokens =
match tokens with
| NAME s1 :: WHITE s2:: NAME s3 :: t -> 
  NAME (String.concat [s1;s2;s3]) :: pack_names t
| WHITE :: t -> pack_names t
| h :: t -> h :: pack_names t
| [] -> []

Yeah, this isn't idea because it isn't tail recursive,
but it should illustrate the idea: do the hard stuff
in a language capable of handling the hard stuff easily .. :)

Tail rec version:

let pack output input = match input with
| NAME .... :: t -> 
  pack (NAME (Str.....) :: output) t
 ...
[] -> List.rev output

This technique can often be used to condition a
nasty language into an LALR(1) .. I managed to turn
Python into an LALR(1) languages .. but it took
17 preprocessing passes to do it .. :))
Mainly, fiddling with INDENT, UNDENT since Python
is based on indentation rules, but a lot of hassles
with the expression x,y, which is extremely hard
to parse (optional trailing comma ..)


-- 
John Skaller <skaller at users dot sourceforge dot net>


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-08-08  8:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-07 21:39 [Caml-list] ocamllex+ocamlyacc and not parsing properly Jonathan Roewen
2005-08-07 21:54 ` Jonathan Roewen
2005-08-07 23:58 ` Jon Harrop
2005-08-08  2:17   ` Jonathan Roewen
2005-08-08  4:23     ` Jonathan Roewen
2005-08-08  5:03       ` Jonathan Roewen
2005-08-08  6:39         ` Jon Harrop
2005-08-08  6:47           ` Jonathan Roewen
2005-08-08  8:59       ` skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).