* index of substring @ 2005-01-27 0:41 yjc01 2005-01-27 4:21 ` [Caml-list] " Matt Gushee ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: yjc01 @ 2005-01-27 0:41 UTC (permalink / raw) To: caml-list I tried to write a function to return the index of a substring in a string from an index. So I can use it to extract some comments (text between (* and *) in a file. But came across some problem when returning the value. The code I came up with is: (* Use Unix Module *) open Unix;; (* get the file size *) let fileReader = openfile "student.cd" [O_RDONLY] 0o640;; let fileSize = (fstat fileReader).st_size;; print_int fileSize;; (* create variable to store file *) let comments = String.make fileSize 'c';; let read_char () = read fileReader comments 0 fileSize;; (* return index of a substring in a string from an index, if substring not matched retun -1 *) let index_left fromIndex string findString = (* Step 1. Find first character *) let firstIndex = String.index_from string fromIndex (String.get findString 0) in if (firstIndex >= 0) then let isMatched = ref true in for i = 1 to (String.length findString) do if (String.get string firstIndex) != (String.get findString 0) then !isMatched = false; if (!isMatched) then firstIndex else -1; done;; (* test if index_left works *) let check index = index_left index comments "(*" print_int check 1;; Or is there an easier way to do this? Daisy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 0:41 index of substring yjc01 @ 2005-01-27 4:21 ` Matt Gushee 2005-01-27 6:44 ` Radu Grigore ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Matt Gushee @ 2005-01-27 4:21 UTC (permalink / raw) To: caml-list yjc01@doc.ic.ac.uk wrote: > I tried to write a function to return the index of a substring in a string from > an index. So I can use it to extract some comments (text between (* and *) in a > file. But came across some problem when returning the value. Well, I'm not the world's greatest OCaml expert, but I enjoy this sort of problem, so I took a stab at it. First, a couple of quick comments on your code: > (* create variable to store file *) > let comments = String.make fileSize 'c';; I don't really understand the need for this. If, as it appears, this string will hold the contents of the input file, why not just create a file-reading function that returns a string? > (* return index of a substring in a string from an index, if substring not > matched retun -1 *) You must be a C programmer ;-) Not that it's necessarily wrong to return -1. But there are OCaml idioms that to me seem more natural for this case. You could use an int option type ( None | Some x ), or you could keep the return type as an int, raising Not_found when appropriate. > Or is there an easier way to do this? There are a number of possible solutions. The easiest would be to use regexp matching (standard Str module, or 3rd-party Pcre). I'm sure you can figure that one out on your own, so I won't pursue it further. If you want a more robust solution, you should probably use one of the fine parsing tools available: ocamllex/ocamlyacc or camlp4. But those are large topics in their own right. I worked up another solution that uses only the standard library. It's probably not easier, but it is mostly functional where your code was imperative, and is arguably more elegant. Maybe it will give you an idea or two. (* val sync_start : char -> char Stream.t -> char Stream.t -> unit *) let sync_start chr s1 s2 = (* "Sets" both streams to the position of the starting character. * A Stream.Failure exception may be raised; it is deliberately * not handled here. *) let rec sync () = match Stream.next s1 with (* Since the start character is removed from s1, s2 must * be advanced by one to match. *) | c when c = chr -> Stream.junk s2 | _ -> sync () in sync () (* val index_left : ?start:int -> string -> string -> int *) let index_left ?(start=0) src target = let first = target.[0] and sstream = Stream.of_string src in (* Set the source stream to the desired start position. *) for i = 1 to start do try ignore (Stream.next sstream) with Stream.Failure -> raise Not_found done; (* We need a fresh target stream each time the search is * repeated, so the following is a function. Hmm, there's * probably a better solution, but it hasn't occurred to * me yet. *) let mkts () = Stream.of_string target and advance s = (* The return type is char option because it will be used * in pattern matching. *) try Some (Stream.next s) with Stream.Failure -> None in let rec search tstream = ( try sync_start first sstream tstream (* If we reach the end of the source stream, then of course * that means the substring was not found, so there's no * point in continuing. *) with Stream.Failure -> raise Not_found ); let startpos = Stream.count sstream - 1 in (* Continue searching once the start character is found. *) let rec continue () = match advance sstream, advance tstream with (* Everything matches up to this point. Continue * searching. *) | Some c1, Some c2 when c1 = c2 -> continue () (* Substring is not matched here. Start over. *) | Some c1, Some c2 -> search (mkts ()) (* The target stream is exhausted, no non-matching * characters found, so we have a complete match. Return * the starting index. *) | _, None -> startpos (* The source stream is exhausted, but one or more * characters remain in the target. No match. *) | None, Some _ -> raise Not_found in continue () in search (mkts ()) It occurs to me now that if you wanted to use this approach to search sequentially for different substrings, it would probably be best to create the source stream outside the search function. Then you would reuse that stream over several invocations of the function; each time you searched, the stream would be positioned at the end of the previous search. Hope this helps a bit. -- Matt Gushee Englewood, CO USA ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 0:41 index of substring yjc01 2005-01-27 4:21 ` [Caml-list] " Matt Gushee @ 2005-01-27 6:44 ` Radu Grigore 2005-01-27 9:36 ` skaller 2005-01-27 8:37 ` Oliver Bandel 2005-01-27 9:17 ` Xavier Leroy 3 siblings, 1 reply; 11+ messages in thread From: Radu Grigore @ 2005-01-27 6:44 UTC (permalink / raw) To: yjc01; +Cc: caml-list On Thu, 27 Jan 2005 00:41:18 +0000, yjc01@doc.ic.ac.uk <yjc01@doc.ic.ac.uk> wrote: > Or is there an easier way to do this? Ocamllex is definitely the easiest. -- regards, radu http://rgrig.idilis.ro/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 6:44 ` Radu Grigore @ 2005-01-27 9:36 ` skaller 2005-01-27 10:20 ` Jean-Christophe Filliatre 0 siblings, 1 reply; 11+ messages in thread From: skaller @ 2005-01-27 9:36 UTC (permalink / raw) To: Radu Grigore; +Cc: yjc01, caml-list On Thu, 2005-01-27 at 17:44, Radu Grigore wrote: > On Thu, 27 Jan 2005 00:41:18 +0000, yjc01@doc.ic.ac.uk > <yjc01@doc.ic.ac.uk> wrote: > > Or is there an easier way to do this? > > Ocamllex is definitely the easiest. however by itself this will not supported nested comments which Ocaml allows .. :) -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 9:36 ` skaller @ 2005-01-27 10:20 ` Jean-Christophe Filliatre 2005-01-27 16:55 ` skaller 0 siblings, 1 reply; 11+ messages in thread From: Jean-Christophe Filliatre @ 2005-01-27 10:20 UTC (permalink / raw) To: skaller; +Cc: Radu Grigore, yjc01, caml-list skaller writes: > > > > Ocamllex is definitely the easiest. > > however by itself this will not supported nested > comments which Ocaml allows .. :) What's wrong with this way of handling nested comments with ocamllex: ====================================================================== | "(*" { comment lexbuf; ... } ... and comment = parse | "(*" { comment lexbuf; comment lexbuf } | "*)" { () } | _ { comment lexbuf } ====================================================================== (I may have misunderstood the issue here, since I didn't follow this thread in details...) -- Jean-Christophe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 10:20 ` Jean-Christophe Filliatre @ 2005-01-27 16:55 ` skaller 2005-01-27 17:20 ` Radu Grigore 2005-01-28 8:51 ` Jean-Christophe Filliatre 0 siblings, 2 replies; 11+ messages in thread From: skaller @ 2005-01-27 16:55 UTC (permalink / raw) To: Jean-Christophe Filliatre; +Cc: Radu Grigore, yjc01, caml-list On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote: > > What's wrong with this way of handling nested comments with ocamllex: > > ====================================================================== > > | "(*" { comment lexbuf; ... } > ... > > and comment = parse > | "(*" { comment lexbuf; comment lexbuf } > | "*)" { () } > | _ { comment lexbuf } > Well it doesn't handle (* or *) in strings .. However, whilst this code (given a fix to handle strings) would probably work, it isn't a plain lexer, but a mix of lexing and client code. In particular the recursion only works 'by luck of the implementation' that allows sharing of a lexbuf between two lexer functions -- who knows how the lexer handles fallbacks? I have actually being toying with an implementation that does a bit of 'Boyer/Moore' processing to reduce or eliminate the need to rescan characters. The idea is that whenever to reach an accepting state, you create a new automaton, and begin scanning from the start state. When you finally fail you can fallback to the last success, but to lex the next token you do not need to rescan from the start state with the next character, since you have already done that. Although this idea may actually lead to faster processing sometimes, a big advantage is that it works with input iterators, whereas fallback and rescan requires a forward iterator. An input iterator can be converted to a forward one, but only with a potentially infinite buffer. In theory this means that whilst a deterministic finite state automaton can *recognize* a string with finite memory in linear time, tokenisation is in fact NOT finite state at all, but requires linear state -- which is a major disadvantage. By actually merging the automata described above, it may be possible for the 'regexp compiler' to bound the extra state required for some regexps, obtaining finite state tokenisers, whereas at present lexers are not. Anyhow, the point is that with such an augmented lexer it is not clear 'client code' would be allowed to call another (or the same) lexer with the same lexbuf. [In other words, your code is not transparent.] -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 16:55 ` skaller @ 2005-01-27 17:20 ` Radu Grigore 2005-01-28 8:51 ` Jean-Christophe Filliatre 1 sibling, 0 replies; 11+ messages in thread From: Radu Grigore @ 2005-01-27 17:20 UTC (permalink / raw) To: skaller; +Cc: Jean-Christophe Filliatre, yjc01, caml-list On 28 Jan 2005 03:55:36 +1100, skaller <skaller@users.sourceforge.net> wrote: > Well it doesn't handle (* or *) in strings .. Making things complicated just for fun isn't actually fun. The fact is that a finite automaton by itself cannot detect matching delimiters. Finite implies that the automaton can't count (at least countable sets) and matching requires counting. Sure, there are lots of nuances here: what does countable mean when you use 32b for integers,... but that's splitting the hair. The basic idea remains even after you look at these additional details. So if you want to detect matching delimiters you need a counter. You can either use an explicit one or use the stack as Jean-Christophe Filliatre did. > In particular the recursion only works 'by luck of > the implementation' that allows sharing of a lexbuf > between two lexer functions What do you mean "by luck"? The documentation clearly passes the lexbuf from one lexer to the other a few times. If the spec says it's possible then in what way is it "luck"? > -- who knows how > the lexer handles fallbacks? IMO adding a _ -> failwith "some error message" whenever the programmer fails to provide an explicit "else" branch is just fine. In fact I would prefer a "compile" error to be emitted by ocamllex (or at least a warning) when such a clause is missing from one lexer. -- regards, radu http://rgrig.idilis.ro/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 16:55 ` skaller 2005-01-27 17:20 ` Radu Grigore @ 2005-01-28 8:51 ` Jean-Christophe Filliatre 2005-01-28 13:31 ` skaller 1 sibling, 1 reply; 11+ messages in thread From: Jean-Christophe Filliatre @ 2005-01-28 8:51 UTC (permalink / raw) To: skaller; +Cc: Radu Grigore, yjc01, caml-list skaller writes: > On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote: > > > > What's wrong with this way of handling nested comments with ocamllex: > > > > ====================================================================== > > > > | "(*" { comment lexbuf; ... } > > ... > > > > and comment = parse > > | "(*" { comment lexbuf; comment lexbuf } > > | "*)" { () } > > | _ { comment lexbuf } > > > > Well it doesn't handle (* or *) in strings .. I was only focusing on nested comments, but handling strings is rather trivial. See the ocaml lexer for instance (where you can see that introducing a new lexing function "string" is definitely better than trying to write a regular expression for strings literals). > However, whilst this code (given a fix > to handle strings) would probably work, > it isn't a plain lexer, but a mix of lexing > and client code. Sure, it is clearly more powerful than an automaton (since nested comments are not regular), if this is what you mean by "isn't a plain lexer". But that's precisely why ocamllex is so a powerful tool. You only need to know that ocamllex is building a set of mutually recusvive functions with the lexbuf as argument and then you are not limited in what you can do in the actions. You can even pass additional arguments to the lexing functions. I like to think about ocamllex as a general-purpose tool to insert a bit of regular expressions in ocaml programs (lexers, but also filters, file formats readers, line counters, code indenters, etc.) and not only as a tool to write lexers. With the header and trailer, it is even easy to build a whole ocaml program within a single ocamllex file. For instance, two programs of mine I use intensively are a program to count lines of code and comment in my ocaml programs (a 173 lines long ocamllex file) and a preprocessor for my web pages (a 129 lines long ocamllex file). -- Jean-Christophe Filliâtre (http://www.lri.fr/~filliatr) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-28 8:51 ` Jean-Christophe Filliatre @ 2005-01-28 13:31 ` skaller 0 siblings, 0 replies; 11+ messages in thread From: skaller @ 2005-01-28 13:31 UTC (permalink / raw) To: Jean-Christophe Filliatre; +Cc: Radu Grigore, yjc01, caml-list On Fri, 2005-01-28 at 19:51, Jean-Christophe Filliatre wrote: > But that's precisely why ocamllex is so a powerful tool. You only need > to know that ocamllex is building a set of mutually recusvive > functions with the lexbuf as argument and then you are not limited in > what you can do in the actions. You can even pass additional arguments > to the lexing functions. Yes, this is all true, provided you think of your code as a stream pattern matcher: unfortunately lexbufs are mutable rather than functional (which I suppose is necessary for performance). > I like to think about ocamllex as a general-purpose tool to insert a > bit of regular expressions in ocaml programs Yes, although the code is control inverted initially, that is, your client action code is *called* with a lexeme and lexbuf (and callbacks are a bit nasty sometimes). Of course you also call the lexer to fetch what the callback returns .. so yes I guess you can see this as 'inserting' a bit of regular expressions into an ocaml program. Interesting tradeoffs here! Ocaml lexers allow input iterators (eg file streams) by using a lexbuf, but aren't functional because the lexbuf modified. Felix uses a C++ style forward iterator, and is purely functional -- the end of lexeme pointer is returned to be used in the next call as the new starting point -- but if you want to handle an input iterator you have to manage the buffering yourself. [Ahem .. which can't be done transparently at the moment .. woops .. design fault!] -- John Skaller, mailto:skaller@users.sf.net voice: 061-2-9660-0850, snail: PO BOX 401 Glebe NSW 2037 Australia Checkout the Felix programming language http://felix.sf.net ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 0:41 index of substring yjc01 2005-01-27 4:21 ` [Caml-list] " Matt Gushee 2005-01-27 6:44 ` Radu Grigore @ 2005-01-27 8:37 ` Oliver Bandel 2005-01-27 9:17 ` Xavier Leroy 3 siblings, 0 replies; 11+ messages in thread From: Oliver Bandel @ 2005-01-27 8:37 UTC (permalink / raw) To: caml-list On Thu, Jan 27, 2005 at 12:41:18AM +0000, yjc01@doc.ic.ac.uk wrote: > I tried to write a function to return the index of a substring in a string from > an index. What? Can you explain your problem with an example? What do you want to achieve? Example of input/output-data...?! > So I can use it to extract some comments (text between (* and *) in a > file. But came across some problem when returning the value. Well, if you want to return the comments, you may use ocamllex. There is a tutorial on ocamllex, which some weeks ago was announced here. I have read it and it should explain enough and detailed so that you may use ocamllex. But BTW: There is a Str-module. You can get index-values for matched substrings with it. You have to use a regular expression (regexp) to match the strings. You first have to compile the regexp with Str.regexp. After you have compiled the regexp-string (definition) into a regexp-type with Str.regexp, you can use that compiled regexp with the matching functions of the Str-module. > > The code I came up with is: > (* Use Unix Module *) > open Unix;; > > > (* get the file size *) > let fileReader = openfile "student.cd" [O_RDONLY] 0o640;; > let fileSize = (fstat fileReader).st_size;; > print_int fileSize;; > > (* create variable to store file *) > let comments = String.make fileSize 'c';; > > > let read_char () = > read fileReader comments 0 fileSize;; let filename = "student.cd" let inchannel = open_in filename You can read such channels with many functions. They are all documented in the OCaml reference manual. Look in chapter 19 of the Manual (Core Library) and look for "Pervasives". For example you can read one line from that channel with the function input_line. So with the definitions above, you read a line as follows: let next_line () = input_line inchannel After you are ready with that file (you get an exception which indicates end of file), you can close it with the close_in function. Ciao, Oliver ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Caml-list] index of substring 2005-01-27 0:41 index of substring yjc01 ` (2 preceding siblings ...) 2005-01-27 8:37 ` Oliver Bandel @ 2005-01-27 9:17 ` Xavier Leroy 3 siblings, 0 replies; 11+ messages in thread From: Xavier Leroy @ 2005-01-27 9:17 UTC (permalink / raw) To: yjc01; +Cc: caml-list > I tried to write a function to return the index of a substring in a > string from an index. [...] May I suggest asking the OCaml beginner's list instead? http://groups.yahoo.com/group/ocaml_beginners You're more likely to find people willing to help you with your coding there than in this list. Thanks, - Xavier Leroy ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-01-28 13:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-01-27 0:41 index of substring yjc01 2005-01-27 4:21 ` [Caml-list] " Matt Gushee 2005-01-27 6:44 ` Radu Grigore 2005-01-27 9:36 ` skaller 2005-01-27 10:20 ` Jean-Christophe Filliatre 2005-01-27 16:55 ` skaller 2005-01-27 17:20 ` Radu Grigore 2005-01-28 8:51 ` Jean-Christophe Filliatre 2005-01-28 13:31 ` skaller 2005-01-27 8:37 ` Oliver Bandel 2005-01-27 9:17 ` Xavier Leroy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).