From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by yquem.inria.fr (Postfix) with ESMTP id DEA8ABBAF for ; Fri, 16 May 2008 17:25:17 +0200 (CEST) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AokAACtELUiGnQCBomdsb2JhbACSKQEBAQEBAQcFBgkRmzo X-IronPort-AV: E=Sophos;i="4.27,498,1204498800"; d="ml'?mli'?scan'208";a="12671328" Received: from shiva.jussieu.fr ([134.157.0.129]) by mail3-smtp-sop.national.inria.fr with ESMTP; 16 May 2008 17:25:17 +0200 Received: from hydrogene.pps.jussieu.fr (hydrogene.pps.jussieu.fr [134.157.168.1]) by shiva.jussieu.fr (8.14.2/jtpda-5.4) with ESMTP id m4GFO5X6053878 for ; Fri, 16 May 2008 17:24:17 +0200 (CEST) X-Ids:166 Received: from hydrogene.pps.jussieu.fr (localhost.localdomain [127.0.0.1]) by hydrogene.pps.jussieu.fr (8.13.4/jtpda-5.4) with ESMTP id m4GFO4B4025130 for ; Fri, 16 May 2008 17:24:04 +0200 Received: (from abate@localhost) by hydrogene.pps.jussieu.fr (8.13.4/8.13.2/Submit) id m4GFO4R8025129 for caml-list@yquem.inria.fr; Fri, 16 May 2008 17:24:04 +0200 Date: Fri, 16 May 2008 17:24:04 +0200 From: Pietro Abate To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] camlp4 and lexers Message-ID: <20080516152404.GA23302@uranium.pps.jussieu.fr> References: <20080515150033.GA31934@uranium.pps.jussieu.fr> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="45Z9DzgjV8m4Oswq" Content-Disposition: inline In-Reply-To: <20080515150033.GA31934@uranium.pps.jussieu.fr> X-Operating-System: GNU/Linux User-Agent: Mutt/1.5.9i X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (shiva.jussieu.fr [134.157.0.166]); Fri, 16 May 2008 17:24:17 +0200 (CEST) X-Virus-Scanned: ClamAV 0.92/7136/Fri May 16 14:56:53 2008 on shiva.jussieu.fr X-Virus-Status: Clean X-Miltered: at jchkmail.jussieu.fr with ID 482DA715.005 by Joe's j-chkmail (http : // j-chkmail dot ensmp dot fr)! X-j-chkmail-Enveloppe: 482DA715.005/134.157.168.1/hydrogene.pps.jussieu.fr/hydrogene.pps.jussieu.fr/ X-j-chkmail-Score: MSGID : 482DA715.005 on jchkmail.jussieu.fr : j-chkmail score : . : R=. U=. O=. B=0.051 -> S=0.051 X-j-chkmail-Status: Ham X-Spam: no; 0.00; camlp:01 lexers:01 lexer:01 cduce:01 lexer:01 ocamllex:01 instanciate:01 camlp:01 sigs:01 parser:01 syntax:01 mli:01 nicolas':01 parser:01 seq:01 X-Attachments: name="ulexer.ml" name="ulexer.mli" name="myocamlbuild.ml" --45Z9DzgjV8m4Oswq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi again. I have a minimal (?) lexer (attached) working with the grammar below. For the purpose of this excercise I used ulex. I started with the cduce lexer and removed all cduce-specific functions. However I'm not enterely happy. First I'd like to have another example using ocamllex and not ulex (one less dependecy), but I guess this is not too hard to do. Second, I've copy-pasted some code in the lexer to instanciate the camlp4 modules, but I'm not sure what is required and what is not. I mean, I can look at the camlp4 modules sigs, but without documentation there are a lot of functions that I don't really understand. Can anybody explain the signature of the Loc, Token and Error modules ? How these function used within the camlp4 parsing machinery ? - Token.match_keyword - Token.extract_string - Token.Filter.mk - Token.Filter.filter - Token.Filter.define_filter - Token.Filter.keyword_added - Token.Filter.keyword_removed Third, I'm not sure if this is the real minimal example I was looking for. I've the impression I could reuse the Camlp4.PreCast.Loc module, but I'm not sure if I can reuse the Camlp4.PreCast.Token since it is linked with the token type definition. I don't think I can reuse/extend the caml_token type... Making the lexer extensible would be a great ! Hope this helps. comments ? pietro This is the _tags file to compile it: ---------- _tags ------- "parser.ml": use_camlp4, pp(camlp4of) "ulexer.ml": pkg_ulex, use_camlp4, syntax_camlp4o "ulexer.mli": use_camlp4, pkg_ulex ----------- + nicolas' universal myocamlbuil.ml -------------------- parser.ml ----------------------- type t = Seq of t * t | Alt of t * t | Opt of t | Star of t | Plus of t | Dot | Sym of char open Ulexer module RegExGram = Camlp4.Struct.Grammar.Static.Make(Ulexer) let regex = RegExGram.Entry.mk "regex" (* I guess I don't need to use KWD *) EXTEND RegExGram GLOBAL: regex; regex: [[ e1 = SELF ; `KWD "|" ; e2 = concat -> Alt(e1,e2) | e1 = concat -> e1 ] ]; concat:[[ e1 = SELF ; `KWD ";"; e2 = seq -> Seq(e1,e2) | e1 = SELF ; e2 = seq -> Seq(e1,e2) | e1 = seq -> e1 ] ]; seq: [[ e1 = simple ; `KWD "?" -> Opt e1 | e1 = simple ; `KWD "*" -> Star e1 | e1 = simple ; `KWD "+" -> Plus e1 | e1 = simple -> e1 ] ]; simple:[[ `KWD "." -> Dot | `KWD "("; e1 = regex; `KWD ")" -> e1 | `CHAR(s) -> Sym s ] ]; END let from_string s = RegExGram.parse_string regex (Loc.mk "") s ------------------------------------------------------ --45Z9DzgjV8m4Oswq Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="ulexer.ml" open Camlp4.PreCast module Loc = struct type t = int * int let mk _ = (0,0) let ghost = (-1,-1) let of_lexing_position _ = assert false let to_ocaml_location _ = assert false let of_ocaml_location _ = assert false let of_lexbuf _ = assert false let of_tuple _ = assert false let to_tuple _ = assert false let merge (x1, x2) (y1, y2) = (min x1 y1, max x2 y2) let join (x1, _) = (x1, x1) let move _ _ _ = assert false let shift _ _ = assert false let move_line _ _ = assert false let file_name _ = assert false let start_line _ = assert false let stop_line _ = assert false let start_bol _ = assert false let stop_bol _ = assert false let start_off = fst let stop_off = snd let start_pos _ = assert false let stop_pos _ = assert false let is_ghost _ = assert false let ghostify _ = assert false let set_file_name _ = assert false let strictly_before _ = assert false let make_absolute _ = assert false let print _ = assert false let dump _ = assert false let to_string _ = assert false exception Exc_located of t * exn let raise loc exn = match exn with | Exc_located _ -> raise exn | _ -> raise (Exc_located (loc, exn)) let name = ref "_loc" end type token = | KWD of string | CHAR of char | EOI module Token = struct open Format module Loc = Loc type t = token type token = t let sf = Printf.sprintf let to_string = function | CHAR s -> sf "CHAR %c" s | KWD s -> sf "KWD \"%s\"" s | EOI -> sf "EOI" let print ppf x = pp_print_string ppf (to_string x) let match_keyword kwd = function | KWD kwd' when kwd = kwd' -> true | _ -> false let extract_string = function | KWD s -> s | CHAR c -> String.make 1 c | tok -> invalid_arg ("Cannot extract a string from this token: "^ to_string tok) module Error = struct type t = string exception E of string let print = pp_print_string let to_string x = x end module Filter = struct type token_filter = (t, Loc.t) Camlp4.Sig.stream_filter type t = { is_kwd : string -> bool; mutable filter : token_filter } let mk is_kwd = { is_kwd = is_kwd; filter = (fun s -> s) } let filter x = let f tok loc = let tok' = tok in (tok', loc) in let rec filter = parser | [< '(tok, loc); s >] -> [< ' f tok loc; filter s >] | [< >] -> [< >] in fun strm -> x.filter (filter strm) let define_filter x f = x.filter <- f x.filter let keyword_added _ _ _ = () let keyword_removed _ _ = () end end module Error = Camlp4.Struct.EmptyError module L = Ulexing exception Error of int * int * string let error i j s = raise (Error (i,j,s)) (***********************************************************) (* Buffer for string literals *) let string_buff = Buffer.create 1024 let store_lexeme lexbuf = Buffer.add_string string_buff (Ulexing.utf8_lexeme lexbuf) let store_ascii = Buffer.add_char string_buff let store_code = Utf8.store string_buff let clear_buff () = Buffer.clear string_buff let get_stored_string () = let s = Buffer.contents string_buff in clear_buff (); Buffer.clear string_buff; s (***********************************************************) (* Lexer *) let illegal lexbuf = error (L.lexeme_start lexbuf) (L.lexeme_end lexbuf) "Illegal character" let return lexbuf tok = (tok, L.loc lexbuf) let return_loc i j tok = (tok, (i,j)) let rec token = lexer | [' ' '\t'] -> token lexbuf | ['*' '?' '+' '(' ')' ';' '|' ] -> let k = KWD (L.latin1_lexeme lexbuf) in return lexbuf k | ['A'-'Z' 'a'-'z'] -> let c = CHAR (L.latin1_lexeme_char lexbuf 0) in return lexbuf c | eof -> return lexbuf EOI | _ -> illegal lexbuf (***********************************************************) let enc = ref L.Latin1 let lexbuf = ref None let last_tok = ref (KWD "DUMMY") let raise_clean e = clear_buff (); raise e let mk () _loc cs = let lb = L.from_var_enc_stream enc cs in lexbuf := Some lb; let next _ = let tok, loc = try token lb with | Ulexing.Error -> raise_clean (Error (Ulexing.lexeme_end lb, Ulexing.lexeme_end lb, "Unexpected character")) | Ulexing.InvalidCodepoint i -> raise_clean (Error (Ulexing.lexeme_end lb, Ulexing.lexeme_end lb, "Code point invalid for the current encoding")) | e -> raise_clean e in last_tok := tok; Some (tok, loc) in Stream.from next --45Z9DzgjV8m4Oswq Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="ulexer.mli" open Camlp4.Sig type token = | KWD of string | CHAR of char | EOI exception Error of int * int * string module Loc : Loc with type t = int * int module Token : Token with module Loc = Loc and type t = token module Error : Error val mk : unit -> (Loc.t -> char Stream.t -> (Token.t * Loc.t) Stream.t) --45Z9DzgjV8m4Oswq Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="myocamlbuild.ml" open Ocamlbuild_plugin open Command (* no longer needed for OCaml >= 3.10.2 *) (* these functions are not really officially exported *) let run_and_read = Ocamlbuild_pack.My_unix.run_and_read let blank_sep_strings = Ocamlbuild_pack.Lexers.blank_sep_strings (* this lists all supported packages *) let find_packages () = blank_sep_strings & Lexing.from_string & run_and_read "ocamlfind list | cut -d' ' -f1" (* this is supposed to list available syntaxes, but I don't know how to do it. *) let find_syntaxes () = ["camlp4o"; "camlp4r"] (* ocamlfind command *) let ocamlfind x = S[A"ocamlfind"; x] let _ = dispatch begin function | Before_options -> (* override default commands by ocamlfind ones *) Options.ocamlc := ocamlfind & A"ocamlc"; Options.ocamlopt := ocamlfind & A"ocamlopt"; Options.ocamldep := ocamlfind & A"ocamldep"; Options.ocamldoc := ocamlfind & A"ocamldoc" | After_rules -> (* When one link an OCaml library/binary/package, one should use -linkpkg *) flag ["ocaml"; "link"] & A"-linkpkg"; (* For each ocamlfind package one inject the -package option when * compiling, computing dependencies, generating documentation and * linking. *) List.iter begin fun pkg -> flag ["ocaml"; "compile"; "pkg_"^pkg] & S[A"-package"; A pkg]; flag ["ocaml"; "ocamldep"; "pkg_"^pkg] & S[A"-package"; A pkg]; flag ["ocaml"; "doc"; "pkg_"^pkg] & S[A"-package"; A pkg]; flag ["ocaml"; "link"; "pkg_"^pkg] & S[A"-package"; A pkg]; end (find_packages ()); (* Like -package but for extensions syntax. Morover -syntax is useless * when linking. *) List.iter begin fun syntax -> flag ["ocaml"; "compile"; "syntax_"^syntax] & S[A"-syntax"; A syntax]; flag ["ocaml"; "ocamldep"; "syntax_"^syntax] & S[A"-syntax"; A syntax]; flag ["ocaml"; "doc"; "syntax_"^syntax] & S[A"-syntax"; A syntax]; end (find_syntaxes ()); | _ -> () end --45Z9DzgjV8m4Oswq--