caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* index of substring
@ 2005-01-27  0:41 yjc01
  2005-01-27  4:21 ` [Caml-list] " Matt Gushee
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: yjc01 @ 2005-01-27  0:41 UTC (permalink / raw)
  To: caml-list

I tried to write a function to return the index of a substring in a string from
an index. So I can use it to extract some comments (text between (* and *) in a
file. But came across some problem when returning the value.

The code I came up with is: 
(* Use Unix Module *)
open Unix;;


(* get the file size *)
let fileReader = openfile "student.cd" [O_RDONLY] 0o640;;
let fileSize = (fstat fileReader).st_size;;
print_int fileSize;;

(* create variable to store file *)
let comments = String.make fileSize 'c';;

 
let read_char () = 
	read fileReader comments 0 fileSize;;


(* return index of a substring in a string from an index, if substring not
matched retun -1 *)
let index_left fromIndex string findString = 
	(* Step 1. Find first character *)
	let firstIndex = String.index_from string fromIndex (String.get findString 0) in
	
	   if (firstIndex >= 0)	then
		let isMatched = ref true in
		for i = 1 to (String.length findString) do
			if (String.get string firstIndex) != (String.get findString 0) 
			   then !isMatched = false;
			if (!isMatched)
			   then firstIndex 
			   else -1;
			
		done;;
				
(* test if index_left works *)
let check index = 
	 index_left index comments "(*" 
		

print_int check 1;;


Or is there an easier way to do this?

Daisy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27  0:41 index of substring yjc01
@ 2005-01-27  4:21 ` Matt Gushee
  2005-01-27  6:44 ` Radu Grigore
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Matt Gushee @ 2005-01-27  4:21 UTC (permalink / raw)
  To: caml-list

yjc01@doc.ic.ac.uk wrote:
> I tried to write a function to return the index of a substring in a string from
> an index. So I can use it to extract some comments (text between (* and *) in a
> file. But came across some problem when returning the value.

Well, I'm not the world's greatest OCaml expert, but I enjoy this sort 
of problem, so I took a stab at it.

First, a couple of quick comments on your code:

> (* create variable to store file *)
> let comments = String.make fileSize 'c';;

I don't really understand the need for this. If, as it appears, this 
string will hold the contents of the input file, why not just create a 
file-reading function that returns a string?

> (* return index of a substring in a string from an index, if substring not
> matched retun -1 *)

You must be a C programmer ;-) Not that it's necessarily wrong to return 
  -1. But there are OCaml idioms that to me seem more natural for this 
case. You could use an int option type ( None | Some x ), or you could 
keep the return type as an int, raising Not_found when appropriate.

> Or is there an easier way to do this?

There are a number of possible solutions. The easiest would be to use 
regexp matching (standard Str module, or 3rd-party Pcre). I'm sure you 
can figure that one out on your own, so I won't pursue it further. If 
you want a more robust solution, you should probably use one of the fine 
parsing tools available: ocamllex/ocamlyacc or camlp4. But those are 
large topics in their own right.

I worked up another solution that uses only the standard library. It's 
probably not easier, but it is mostly functional where your code was 
imperative, and is arguably more elegant. Maybe it will give you an idea 
or two.


(* val sync_start : char -> char Stream.t -> char Stream.t -> unit *)
let sync_start chr s1 s2 =
     (* "Sets" both streams to the position of the starting character.
      * A Stream.Failure exception may be raised; it is deliberately
      * not handled here. *)
     let rec sync () =
         match Stream.next s1 with
         (* Since the start character is removed from s1, s2 must
          * be advanced by one to match. *)
         | c when c = chr -> Stream.junk s2
         | _ -> sync () in
     sync ()

(* val index_left : ?start:int -> string -> string -> int *)
let index_left ?(start=0) src target =
     let first = target.[0]
     and sstream = Stream.of_string src in
     (* Set the source stream to the desired start position. *)
     for i = 1 to start do
         try ignore (Stream.next sstream)
         with Stream.Failure -> raise Not_found
     done;
     (* We need a fresh target stream each time the search is
      * repeated, so the following is a function. Hmm, there's
      * probably a better solution, but it hasn't occurred to
      * me yet. *)
     let mkts () = Stream.of_string target
     and advance s =
         (* The return type is char option because it will be used
          * in pattern matching. *)
         try Some (Stream.next s)
         with Stream.Failure -> None in
     let rec search tstream =
         ( try sync_start first sstream tstream
           (* If we reach the end of the source stream, then of course
            * that means the substring was not found, so there's no
            * point in continuing. *)
           with Stream.Failure -> raise Not_found );
         let startpos = Stream.count sstream - 1 in
         (* Continue searching once the start character is found. *)
         let rec continue () =
             match advance sstream, advance tstream with
               (* Everything matches up to this point. Continue
                * searching. *)
             | Some c1, Some c2 when c1 = c2 -> continue ()
               (* Substring is not matched here. Start over. *)
             | Some c1, Some c2 -> search (mkts ())
               (* The target stream is exhausted, no non-matching
                * characters found, so we have a complete match. Return
                * the starting index. *)
             | _, None -> startpos
               (* The source stream is exhausted, but one or more
                * characters remain in the target. No match. *)
             | None, Some _ -> raise Not_found in
         continue () in
     search (mkts ())


It occurs to me now that if you wanted to use this approach to search 
sequentially for different substrings, it would probably be best to 
create the source stream outside the search function. Then you would 
reuse that stream over several invocations of the function; each time 
you searched, the stream would be positioned at the end of the previous 
search.

Hope this helps a bit.

--
Matt Gushee
Englewood, CO USA


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27  0:41 index of substring yjc01
  2005-01-27  4:21 ` [Caml-list] " Matt Gushee
@ 2005-01-27  6:44 ` Radu Grigore
  2005-01-27  9:36   ` skaller
  2005-01-27  8:37 ` Oliver Bandel
  2005-01-27  9:17 ` Xavier Leroy
  3 siblings, 1 reply; 11+ messages in thread
From: Radu Grigore @ 2005-01-27  6:44 UTC (permalink / raw)
  To: yjc01; +Cc: caml-list

On Thu, 27 Jan 2005 00:41:18 +0000, yjc01@doc.ic.ac.uk
<yjc01@doc.ic.ac.uk> wrote:
> Or is there an easier way to do this?

Ocamllex is definitely the easiest.

-- 
regards,
 radu
http://rgrig.idilis.ro/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27  0:41 index of substring yjc01
  2005-01-27  4:21 ` [Caml-list] " Matt Gushee
  2005-01-27  6:44 ` Radu Grigore
@ 2005-01-27  8:37 ` Oliver Bandel
  2005-01-27  9:17 ` Xavier Leroy
  3 siblings, 0 replies; 11+ messages in thread
From: Oliver Bandel @ 2005-01-27  8:37 UTC (permalink / raw)
  To: caml-list

On Thu, Jan 27, 2005 at 12:41:18AM +0000, yjc01@doc.ic.ac.uk wrote:
> I tried to write a function to return the index of a substring in a string from
> an index.

What? Can you explain your problem with an example?

What do you want to achieve?

Example of input/output-data...?!



> So I can use it to extract some comments (text between (* and *) in a
> file. But came across some problem when returning the value.

Well, if you want to return the comments, you may use
ocamllex. There is a tutorial on ocamllex, which some
weeks ago was announced here. I have read it and it should
explain enough and detailed so that you may use ocamllex.

But BTW: There is a Str-module. You can get index-values
for matched substrings with it.
You have to use a regular expression (regexp) to match the
strings. You first have to compile the regexp with Str.regexp.
After you have compiled the regexp-string (definition) into a
regexp-type with Str.regexp, you can use that compiled
regexp with the matching functions of the Str-module.


> 
> The code I came up with is: 
> (* Use Unix Module *)
> open Unix;;
> 
> 
> (* get the file size *)
> let fileReader = openfile "student.cd" [O_RDONLY] 0o640;;
> let fileSize = (fstat fileReader).st_size;;
> print_int fileSize;;
> 
> (* create variable to store file *)
> let comments = String.make fileSize 'c';;
> 
>  
> let read_char () = 
> 	read fileReader comments 0 fileSize;;

let filename = "student.cd"

let inchannel = open_in filename

You can read such channels with many functions.
They are all documented in the OCaml reference manual.
Look in chapter 19 of the Manual (Core Library)
and look for "Pervasives".

For example you can read one line from that channel with
the function input_line.

So with the definitions above, you read a line as follows:

let next_line () = input_line inchannel


After you are ready with that file (you get an exception which
indicates end of file), you can close it with the close_in function.

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27  0:41 index of substring yjc01
                   ` (2 preceding siblings ...)
  2005-01-27  8:37 ` Oliver Bandel
@ 2005-01-27  9:17 ` Xavier Leroy
  3 siblings, 0 replies; 11+ messages in thread
From: Xavier Leroy @ 2005-01-27  9:17 UTC (permalink / raw)
  To: yjc01; +Cc: caml-list

> I tried to write a function to return the index of a substring in a
> string from an index. [...]

May I suggest asking the OCaml beginner's list instead?
      http://groups.yahoo.com/group/ocaml_beginners

You're more likely to find people willing to help you with your coding
there than in this list.

Thanks,

- Xavier Leroy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27  6:44 ` Radu Grigore
@ 2005-01-27  9:36   ` skaller
  2005-01-27 10:20     ` Jean-Christophe Filliatre
  0 siblings, 1 reply; 11+ messages in thread
From: skaller @ 2005-01-27  9:36 UTC (permalink / raw)
  To: Radu Grigore; +Cc: yjc01, caml-list

On Thu, 2005-01-27 at 17:44, Radu Grigore wrote:
> On Thu, 27 Jan 2005 00:41:18 +0000, yjc01@doc.ic.ac.uk
> <yjc01@doc.ic.ac.uk> wrote:
> > Or is there an easier way to do this?
> 
> Ocamllex is definitely the easiest.

however by itself this will not supported nested
comments which Ocaml allows .. :)

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27  9:36   ` skaller
@ 2005-01-27 10:20     ` Jean-Christophe Filliatre
  2005-01-27 16:55       ` skaller
  0 siblings, 1 reply; 11+ messages in thread
From: Jean-Christophe Filliatre @ 2005-01-27 10:20 UTC (permalink / raw)
  To: skaller; +Cc: Radu Grigore, yjc01, caml-list


skaller writes:
 > > 
 > > Ocamllex is definitely the easiest.
 > 
 > however by itself this will not supported nested
 > comments which Ocaml allows .. :)

What's wrong with this way of handling nested comments with ocamllex:

======================================================================

| "(*" { comment lexbuf; ... }
...

and comment = parse
| "(*" { comment lexbuf; comment lexbuf }
| "*)" { () }
| _    { comment lexbuf }

======================================================================

(I may have misunderstood the issue here, since I didn't follow this
thread in details...)

-- 
Jean-Christophe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27 10:20     ` Jean-Christophe Filliatre
@ 2005-01-27 16:55       ` skaller
  2005-01-27 17:20         ` Radu Grigore
  2005-01-28  8:51         ` Jean-Christophe Filliatre
  0 siblings, 2 replies; 11+ messages in thread
From: skaller @ 2005-01-27 16:55 UTC (permalink / raw)
  To: Jean-Christophe Filliatre; +Cc: Radu Grigore, yjc01, caml-list

On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote:
> 
> What's wrong with this way of handling nested comments with ocamllex:
> 
> ======================================================================
> 
> | "(*" { comment lexbuf; ... }
> ...
> 
> and comment = parse
> | "(*" { comment lexbuf; comment lexbuf }
> | "*)" { () }
> | _    { comment lexbuf }
> 

Well it doesn't handle (* or *) in strings ..

However, whilst this code (given a fix
to handle strings) would probably work,
it isn't a plain lexer, but a mix of lexing
and client code. 

In particular the recursion only works 'by luck of 
the implementation' that allows sharing of a lexbuf
between two lexer functions -- who knows how
the lexer handles fallbacks?

I have actually being toying with an implementation
that does a bit of 'Boyer/Moore' processing
to reduce or eliminate the need to rescan characters.

The idea is that whenever to reach an accepting
state, you create a new automaton, and begin
scanning from the start state. When you finally
fail you can fallback to the last success,
but to lex the next token you do not need
to rescan from the start state with the next
character, since you have already done that.

Although this idea may actually lead to faster
processing sometimes, a big advantage is that
it works with input iterators, whereas fallback
and rescan requires a forward iterator. 
An input iterator can be converted to a forward
one, but only with a potentially infinite buffer.

In theory this means that whilst a deterministic
finite state automaton can *recognize* a string
with finite memory in linear time, tokenisation
is in fact NOT finite state at all, but requires
linear state -- which is a major disadvantage.

By actually merging the automata described above,
it may be possible for the 'regexp compiler' to
bound the extra state required for some
regexps, obtaining finite state tokenisers,
whereas at present lexers are not.

Anyhow, the point is that with such an augmented
lexer it is not clear 'client code' would be allowed
to call another (or the same) lexer with the same
lexbuf.

[In other words, your code is not transparent.]

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27 16:55       ` skaller
@ 2005-01-27 17:20         ` Radu Grigore
  2005-01-28  8:51         ` Jean-Christophe Filliatre
  1 sibling, 0 replies; 11+ messages in thread
From: Radu Grigore @ 2005-01-27 17:20 UTC (permalink / raw)
  To: skaller; +Cc: Jean-Christophe Filliatre, yjc01, caml-list

On 28 Jan 2005 03:55:36 +1100, skaller <skaller@users.sourceforge.net> wrote:
> Well it doesn't handle (* or *) in strings ..

Making things complicated just for fun isn't actually fun.

The fact is that a finite automaton by itself cannot detect matching
delimiters. Finite implies that the automaton can't count (at least
countable sets) and matching requires counting. Sure, there are lots
of nuances here: what does countable mean when you use 32b for
integers,... but that's splitting the hair. The basic idea remains
even after you look at these additional details.

So if you want to detect matching delimiters you need a counter. You
can either use an explicit one or use the stack as Jean-Christophe
Filliatre did.

> In particular the recursion only works 'by luck of
> the implementation' that allows sharing of a lexbuf
> between two lexer functions

What do you mean "by luck"? The documentation clearly passes the
lexbuf from one lexer to the other a few times. If the spec says it's
possible then in what way is it "luck"?

>  -- who knows how
> the lexer handles fallbacks?

IMO adding a 
  _ -> failwith "some error message"
whenever the programmer fails to provide an explicit "else" branch is
just fine. In fact I would prefer a "compile" error to be emitted by
ocamllex (or at least a warning) when such a clause is missing from
one lexer.

-- 
regards,
 radu
http://rgrig.idilis.ro/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-27 16:55       ` skaller
  2005-01-27 17:20         ` Radu Grigore
@ 2005-01-28  8:51         ` Jean-Christophe Filliatre
  2005-01-28 13:31           ` skaller
  1 sibling, 1 reply; 11+ messages in thread
From: Jean-Christophe Filliatre @ 2005-01-28  8:51 UTC (permalink / raw)
  To: skaller; +Cc: Radu Grigore, yjc01, caml-list


skaller writes:
 > On Thu, 2005-01-27 at 21:20, Jean-Christophe Filliatre wrote:
 > > 
 > > What's wrong with this way of handling nested comments with ocamllex:
 > > 
 > > ======================================================================
 > > 
 > > | "(*" { comment lexbuf; ... }
 > > ...
 > > 
 > > and comment = parse
 > > | "(*" { comment lexbuf; comment lexbuf }
 > > | "*)" { () }
 > > | _    { comment lexbuf }
 > > 
 > 
 > Well it doesn't handle (* or *) in strings ..

I was only focusing on nested comments, but handling strings is rather
trivial.  See the ocaml  lexer for  instance (where  you can  see that
introducing a  new lexing function "string" is  definitely better than
trying to write a regular expression for strings literals).

 > However, whilst this code (given a fix
 > to handle strings) would probably work,
 > it isn't a plain lexer, but a mix of lexing
 > and client code. 

Sure,  it is  clearly more  powerful than  an automaton  (since nested
comments are not regular), if this  is what you mean by "isn't a plain
lexer".

But that's precisely why ocamllex is so a powerful tool. You only need
to  know  that  ocamllex  is  building a  set  of  mutually  recusvive
functions with the lexbuf as argument  and then you are not limited in
what you can do in the actions. You can even pass additional arguments
to the lexing functions. 

I like to  think about ocamllex as a general-purpose  tool to insert a
bit  of  regular  expressions  in  ocaml programs  (lexers,  but  also
filters, file  formats readers,  line counters, code  indenters, etc.)
and not only  as a tool to write lexers. With  the header and trailer,
it  is even  easy  to build  a  whole ocaml  program  within a  single
ocamllex file.  For  instance, two programs of mine  I use intensively
are a program to count lines  of code and comment in my ocaml programs
(a 173 lines  long ocamllex file) and a preprocessor  for my web pages
(a 129 lines long ocamllex file).

-- 
Jean-Christophe Filliâtre (http://www.lri.fr/~filliatr)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Caml-list] index of substring
  2005-01-28  8:51         ` Jean-Christophe Filliatre
@ 2005-01-28 13:31           ` skaller
  0 siblings, 0 replies; 11+ messages in thread
From: skaller @ 2005-01-28 13:31 UTC (permalink / raw)
  To: Jean-Christophe Filliatre; +Cc: Radu Grigore, yjc01, caml-list

On Fri, 2005-01-28 at 19:51, Jean-Christophe Filliatre wrote:

> But that's precisely why ocamllex is so a powerful tool. You only need
> to  know  that  ocamllex  is  building a  set  of  mutually  recusvive
> functions with the lexbuf as argument  and then you are not limited in
> what you can do in the actions. You can even pass additional arguments
> to the lexing functions. 

Yes, this is all true, provided you think of your code
as a stream pattern matcher: unfortunately lexbufs are
mutable rather than functional (which I suppose is necessary
for performance).

> I like to  think about ocamllex as a general-purpose  tool to insert a
> bit  of  regular  expressions  in  ocaml programs 

Yes, although the code is control inverted initially,
that is, your client action code is *called* with a lexeme and
lexbuf (and callbacks are a bit nasty sometimes).

Of course you also call the lexer to fetch what the callback
returns .. so yes I guess you can see this as 'inserting' a bit
of regular expressions into an ocaml program.

Interesting tradeoffs here! Ocaml lexers allow input
iterators (eg file streams) by using a lexbuf,
but aren't functional because the lexbuf modified.

Felix uses a C++ style forward iterator, and is
purely functional -- the end of lexeme pointer is
returned to be used in the next call as the new
starting point -- but if you want to handle
an input iterator you have to manage the buffering
yourself. [Ahem .. which can't be done transparently
at the moment .. woops .. design fault!]

-- 
John Skaller, mailto:skaller@users.sf.net
voice: 061-2-9660-0850, 
snail: PO BOX 401 Glebe NSW 2037 Australia
Checkout the Felix programming language http://felix.sf.net




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-01-28 13:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-27  0:41 index of substring yjc01
2005-01-27  4:21 ` [Caml-list] " Matt Gushee
2005-01-27  6:44 ` Radu Grigore
2005-01-27  9:36   ` skaller
2005-01-27 10:20     ` Jean-Christophe Filliatre
2005-01-27 16:55       ` skaller
2005-01-27 17:20         ` Radu Grigore
2005-01-28  8:51         ` Jean-Christophe Filliatre
2005-01-28 13:31           ` skaller
2005-01-27  8:37 ` Oliver Bandel
2005-01-27  9:17 ` Xavier Leroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).