caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Bug somewhere...
@ 2002-10-06 22:57 Alessandro Baretta
  2002-10-06 23:06 ` Alessandro Baretta
  2002-10-07  8:03 ` Pierre Weis
  0 siblings, 2 replies; 6+ messages in thread
From: Alessandro Baretta @ 2002-10-06 22:57 UTC (permalink / raw)
  To: Ocaml

It's either on my brain or in the Scanf module, the former 
possibility being definitely more likely.

I have written a very simple program to compute md5 
checksums of a codes taken from a text file. Here it is:

let scan_line () = Scanf.scanf "%[^\n\r]\n" (fun a -> a)
let digest s = String.uppercase
   (Digest.to_hex(Digest.string s))
let digest_line s = print_endline (s ^ "#" ^ (digest s))
let _ = try while true do digest_line (scan_line ()) done
   with End_of_file -> ()


Seems very reasonable...

Here's the input file:

(2002) DMD.CSB.1GL.001.01
(2002) DMD.CSB.1GL.001.02
(2002) DMD.CSB.1GL.001.03
(2002) DMD.CSB.1GL.001.04
(2002) DMD.CSB.1GL.001.05
(2002) DMD.CSB.1GL.001.06
(2002) DMD.CSB.1GL.001.07
(2002) DMD.CSB.1GL.001.08
(2002) DMD.CSB.1GL.001.09
(2002) DMD.CSB.1GL.001.10
(2002) DMD.CSB.1GL.001.11
(2002) DMD.CSB.1GL.001.12
(2002) DMD.CSB.1GL.001.13
(2002) DMD.CSB.1GL.001.14
(2002) DMD.CSB.1GL.001.15
(2002) DMD.CSB.1GL.001.16
(2002) DMD.CSB.1GL.001.17
(2002) DMD.CSB.1GL.001.18
(2002) DMD.CSB.1GL.001.19
(2002) DMD.CSB.1GL.001.20


Now here's the output file:
(2002) DMD.CSB.1GL.001.01#EA486F3F11C1D1E5BE6DDC2A444BC4E1
2002) DMD.CSB.1GL.001.02#4A3E838023756A5EE01C39D5DD02FC07
2002) DMD.CSB.1GL.001.03#605ED19A81C3B7748494038FEE93671A
2002) DMD.CSB.1GL.001.04#F475498E61CC896FA42B3869858B9B69
2002) DMD.CSB.1GL.001.05#60246106058EA46F7C5904F9A7D69FD7
2002) DMD.CSB.1GL.001.06#3FDF89041B44A8A3F5334B500A8B48A0
2002) DMD.CSB.1GL.001.07#657A508D402845454D5EAF0A2BC8380B
2002) DMD.CSB.1GL.001.08#230BDE6A530043CCB01434A6E19DB10E
2002) DMD.CSB.1GL.001.09#39CA6A302A6DE081DFC3BD24C8D4C38E
2002) DMD.CSB.1GL.001.10#BFBAE55D0808B5A8729E23459E45A617
2002) DMD.CSB.1GL.001.11#001F0B9F7F5EEDE05C8BA5A85F7D0F45
2002) DMD.CSB.1GL.001.12#77AB75131372E7FB723B280E084733B0
2002) DMD.CSB.1GL.001.13#1E605246D240D6B5735CDE40FF4614CC
2002) DMD.CSB.1GL.001.14#40970C955978A228AA308AB1B1169800
2002) DMD.CSB.1GL.001.15#7DED9C18A5700389CE670C9E8474C757
2002) DMD.CSB.1GL.001.16#8D396925D7867AF0BF2169B692EAECFF
2002) DMD.CSB.1GL.001.17#DEE78191DEF1E6BA7144AA14E29B8EE6
2002) DMD.CSB.1GL.001.18#F6E082FFD976B0A6721AC056C40C526E
2002) DMD.CSB.1GL.001.19#34F915DBF5B258C7BD4200C753C42BD1
2002) DMD.CSB.1GL.001.20#D310054DE7CF959F5946FABAF561FBEF

The '(' is only present on the first line, indicating--so it 
seems--that scanf is eating-away one more character than it 
should every time.

Do I need brain surgery or is there really a problem with scanf?

Alex

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Bug somewhere...
  2002-10-06 22:57 [Caml-list] Bug somewhere Alessandro Baretta
@ 2002-10-06 23:06 ` Alessandro Baretta
  2002-10-08 20:07   ` Pierre Weis
  2002-10-07  8:03 ` Pierre Weis
  1 sibling, 1 reply; 6+ messages in thread
From: Alessandro Baretta @ 2002-10-06 23:06 UTC (permalink / raw)
  To: Ocaml



Alessandro Baretta wrote:
> It's either on my brain or in the Scanf module, the former possibility 
> being definitely more likely.
> 
> I have written a very simple program to compute md5 checksums of a codes 
> taken from a text file. Here it is:
> 
> let scan_line () = Scanf.scanf "%[^\n\r]\n" (fun a -> a)
> let digest s = String.uppercase
>   (Digest.to_hex(Digest.string s))
> let digest_line s = print_endline (s ^ "#" ^ (digest s))
> let _ = try while true do digest_line (scan_line ()) done
>   with End_of_file -> ()

I have rewritten my program in ocamllex. This one works. 
Here it is.

{

}

rule scanline = parse
| [^'\n''\r']*  {Lexing.lexeme lexbuf}
| ['\n''\r']*   {scanline lexbuf    }
| eof           {raise End_of_file}

{
let lexbuf = Lexing.from_channel stdin in
let digest s = String.uppercase
   (Digest.to_hex (Digest.string s)) in
let digest_line s = print_endline (s ^ "#" ^ (digest s)) in
   try while true do digest_line (scanline lexbuf) done
   with End_of_file -> ()

}

> Seems very reasonable...
> 
> Here's the input file:
> 
> (2002) DMD.CSB.1GL.001.01
> (2002) DMD.CSB.1GL.001.02
> (2002) DMD.CSB.1GL.001.03
> (2002) DMD.CSB.1GL.001.04
> (2002) DMD.CSB.1GL.001.05
> (2002) DMD.CSB.1GL.001.06
> (2002) DMD.CSB.1GL.001.07
> (2002) DMD.CSB.1GL.001.08
> (2002) DMD.CSB.1GL.001.09
> (2002) DMD.CSB.1GL.001.10
> (2002) DMD.CSB.1GL.001.11
> (2002) DMD.CSB.1GL.001.12
> (2002) DMD.CSB.1GL.001.13
> (2002) DMD.CSB.1GL.001.14
> (2002) DMD.CSB.1GL.001.15
> (2002) DMD.CSB.1GL.001.16
> (2002) DMD.CSB.1GL.001.17
> (2002) DMD.CSB.1GL.001.18
> (2002) DMD.CSB.1GL.001.19
> (2002) DMD.CSB.1GL.001.20

And the correct output:
(2002) DMD.CSB.1GL.001.01#EA486F3F11C1D1E5BE6DDC2A444BC4E1
(2002) DMD.CSB.1GL.001.02#DA0E405C9E982D4C51F9D21A2FAB5644
(2002) DMD.CSB.1GL.001.03#9D78774667150BBF2FE473CC149A72DB
(2002) DMD.CSB.1GL.001.04#72491ED198C8BAB5A659EF4730EBF76D
(2002) DMD.CSB.1GL.001.05#AE3CF2982E265B582725AFE770F685F8
(2002) DMD.CSB.1GL.001.06#8825A66BB3C4D1CEB362631C41FF0633
(2002) DMD.CSB.1GL.001.07#AE4F3D477E43943B044E05D5A0BDD498
(2002) DMD.CSB.1GL.001.08#84E0420BB0B52931EF839FB2673116D3
(2002) DMD.CSB.1GL.001.09#144ABD1E3136EBC4BF9642599340326A
(2002) DMD.CSB.1GL.001.10#92C65BDDFB8045D96D9B3DDE2580896C
(2002) DMD.CSB.1GL.001.11#AB9A737B83B040BCD4CE310977B3667B
(2002) DMD.CSB.1GL.001.12#20C1B0322756CC61D3792A6814FA175A
(2002) DMD.CSB.1GL.001.13#20C76BA308A80C93CA2A7FFCCBCD9696
(2002) DMD.CSB.1GL.001.14#BDD11EF273D429A7460E4A010F28AF8D
(2002) DMD.CSB.1GL.001.15#D55A8BEE54618241691AD349DB5D3B0A
(2002) DMD.CSB.1GL.001.16#D655BDC9DB0C22A2A03B718125884778
(2002) DMD.CSB.1GL.001.17#4EA753AEF91A7F497689DF1E43E0D083
(2002) DMD.CSB.1GL.001.18#B37C19DBE5ED47E9F3F9C8E257BC8F3E
(2002) DMD.CSB.1GL.001.19#A35BEE6D08F95935BFFC61ACFEAC54B7
(2002) DMD.CSB.1GL.001.20#FB357D47CF387E1EBFD94C9E79A1DD6A

What's wrong with the Scanf version?

Alex

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Bug somewhere...
  2002-10-06 22:57 [Caml-list] Bug somewhere Alessandro Baretta
  2002-10-06 23:06 ` Alessandro Baretta
@ 2002-10-07  8:03 ` Pierre Weis
  1 sibling, 0 replies; 6+ messages in thread
From: Pierre Weis @ 2002-10-07  8:03 UTC (permalink / raw)
  To: Alessandro Baretta; +Cc: caml-list

> It's either on my brain or in the Scanf module, the former 
> possibility being definitely more likely.
[...]
> 
> The '(' is only present on the first line, indicating--so it 
> seems--that scanf is eating-away one more character than it 
> should every time.
> 
> Do I need brain surgery or is there really a problem with scanf?
> 
> Alex

You probably discovered a bug in the implementation of the Scanf
module :(

I will correct it in the working sources, as soon as possible.

However you should report those bugs to caml-bugs@inria.fr, instead of
reporting to this list. We have a bug tracking system which is much
easier to deal with for recording and tracking than the Caml mailing
list...

Best regards,

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Bug somewhere...
  2002-10-06 23:06 ` Alessandro Baretta
@ 2002-10-08 20:07   ` Pierre Weis
  2002-10-08 21:26     ` Eric C. Cooper
  2002-10-08 23:31     ` Alessandro Baretta
  0 siblings, 2 replies; 6+ messages in thread
From: Pierre Weis @ 2002-10-08 20:07 UTC (permalink / raw)
  To: Alessandro Baretta; +Cc: caml-list

> Alessandro Baretta wrote:
> > It's either on my brain or in the Scanf module, the former possibility 
> > being definitely more likely.
> > 
> > I have written a very simple program to compute md5 checksums of a codes 
> > taken from a text file. Here it is:
> > 
> > let scan_line () = Scanf.scanf "%[^\n\r]\n" (fun a -> a)
> > let digest s = String.uppercase
> >   (Digest.to_hex(Digest.string s))
> > let digest_line s = print_endline (s ^ "#" ^ (digest s))
> > let _ = try while true do digest_line (scan_line ()) done
> >   with End_of_file -> ()
> 
> I have rewritten my program in ocamllex. This one works. 
> Here it is.
> 
> {
> 
> }
> 
> rule scanline = parse
> | [^'\n''\r']*  {Lexing.lexeme lexbuf}
> | ['\n''\r']*   {scanline lexbuf    }
> | eof           {raise End_of_file}
> 
> {
> let lexbuf = Lexing.from_channel stdin in
> let digest s = String.uppercase
>    (Digest.to_hex (Digest.string s)) in
> let digest_line s = print_endline (s ^ "#" ^ (digest s)) in
>    try while true do digest_line (scanline lexbuf) done
>    with End_of_file -> ()
> 
> }
> 
> > Seems very reasonable...
[...]
> 
> What's wrong with the Scanf version?
> 
> Alex

A lot of problems in here: some are due to the semantics of the Scanf
module some are due to the implementation, some are even deeper than
those two!

Indeed the two programs are not equivalent (and their behaviour are
indeed different!).

The first reason is that you cannot match eof (as you did with your
lexer) using Scanf. This could be considered as a missing feature and
we may add a convention to match end of file (either ``@.'', ``@$'',
or ``$'' ?).

Second, your lexer uses an explicitely allocated buffer lexbuf, while
the scanf corresponding call allocates a new input buffer for each
invocation; but the semantics of Scanf imposes a look ahead of 1
character to check that no other \n follows the \n that ends your
pattern (the semantics of \n being to match 0 or more \n, space, tab,
or return). For each line Scanf reads an extra character after the end
of line; it stores this character (wihch is a '(' by the way) in the
input buffer; but note that the character has been read from the
in_channel; now the next scanf invocation will allocate a new input
buffer that reads from stdin starting after the last character read by
the preceding invocation (the '(' looahead character). Hence you
see that a '(' is missing at the beginning of each line after the
first one!

To solve this problem, you should use bscanf and an explicitely
allocated input buffer that would survive from one call to scanf to
the next one. Considering that this phenomenon is general concerning
stdin and scanf, I rewrote the scanf code such that it allocates a
buffer once and for all. Hence this problem is solved in the working
sources.

In the mean time explicitely allocating an input buffer would solve
this problem for you:

let lexbuf = Scanf.Scanning.from_channel stdin
let scan_line () = Scanf.bscanf lexbuf "%[^\n\r]\n" (fun a -> a)
let digest s = String.uppercase
  (Digest.to_hex(Digest.string s))
let digest_line s = print_endline (s ^ "#" ^ (digest s))
let _ = try while true do digest_line (scan_line ()) done
   with End_of_file -> ()

Another semantical question is: should the call

sscanf "" "%[^\n\r]\n" (fun x -> x)

be successful or not ? If yes, what happens to your problem ?

An interesting example indeed that helps precising the semantics of
Scanf patterns and functions, thank you very much!

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://pauillac.inria.fr/~weis/


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Bug somewhere...
  2002-10-08 20:07   ` Pierre Weis
@ 2002-10-08 21:26     ` Eric C. Cooper
  2002-10-08 23:31     ` Alessandro Baretta
  1 sibling, 0 replies; 6+ messages in thread
From: Eric C. Cooper @ 2002-10-08 21:26 UTC (permalink / raw)
  To: caml-list

On Tue, Oct 08, 2002 at 10:07:01PM +0200, Pierre Weis wrote:
> A lot of problems in here: some are due to the semantics of the Scanf
> module some are due to the implementation, some are even deeper than
> those two!
> ... 
> To solve this problem, you should use bscanf and an explicitely
> allocated input buffer that would survive from one call to scanf to
> the next one. Considering that this phenomenon is general concerning
> stdin and scanf, I rewrote the scanf code such that it allocates a
> buffer once and for all. Hence this problem is solved in the working
> sources.

In the C stdio library, this is solved by ungetc() (push back an
already-read character).  That might be a useful addition to the
operations on in_channels.

-- 
Eric C. Cooper          e c c @ c m u . e d u
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Bug somewhere...
  2002-10-08 20:07   ` Pierre Weis
  2002-10-08 21:26     ` Eric C. Cooper
@ 2002-10-08 23:31     ` Alessandro Baretta
  1 sibling, 0 replies; 6+ messages in thread
From: Alessandro Baretta @ 2002-10-08 23:31 UTC (permalink / raw)
  To: Pierre Weis, Ocaml



Pierre Weis wrote:
> 
> A lot of problems in here: some are due to the semantics of the Scanf
> module some are due to the implementation, some are even deeper than
> those two!
> 
> Indeed the two programs are not equivalent (and their behaviour are
> indeed different!).

They are meant to be equivalent under the following 
assumption: the input file is divided in lines which are 
terminated by either '\n' or '\r'. The difference is mostly 
due to the fact that Scanf 3.06 reads an extra character 
with respect to the specified format string. Any other 
differences are attributable to faulty connections in my brain.

> The first reason is that you cannot match eof (as you did with your
> lexer) using Scanf. This could be considered as a missing feature and
> we may add a convention to match end of file (either ``@.'', ``@$'',
> or ``$'' ?).

I can live with this. What Scanf *really lacks* is a 
C-equivalent support for partial matches. If a C-format 
matches only partially, only the conversions specified in 
the matched prefix are performed. In O'Caml, Scanf throws an 
exception. A better solution would be for Scanf.scanf to 
have type :
('a, Scanning.scanbuf, 'b) format -> 'a option -> 'b
If a conversion is performed then the callback function is 
passed Some(<result>); otherwise, in a partial match f gets 
a number of None actual parameters from scanf.

This approach would make Scanf much more useful. We would be 
  able to explicitly code simple parsers in Ocaml logic and 
Scanf formats, when, at present, we would be forced to go 
with Ocamllex/yacc. Take my case, for example.

> Second, your lexer uses an explicitely allocated buffer lexbuf, while
> the scanf corresponding call allocates a new input buffer for each
> invocation; but the semantics of Scanf imposes a look ahead of 1
> character to check that no other \n follows the \n that ends your
> pattern (the semantics of \n being to match 0 or more \n, space, tab,
> or return). For each line Scanf reads an extra character after the end
> of line; it stores this character (wihch is a '(' by the way) in the
> input buffer; but note that the character has been read from the
> in_channel; now the next scanf invocation will allocate a new input
> buffer that reads from stdin starting after the last character read by
> the preceding invocation (the '(' looahead character). Hence you
> see that a '(' is missing at the beginning of each line after the
> first one!

This behaviour is couterintuitive, and should be considered 
buggy.

> To solve this problem, you should use bscanf and an explicitely
> allocated input buffer that would survive from one call to scanf to
> the next one. Considering that this phenomenon is general concerning
> stdin and scanf, I rewrote the scanf code such that it allocates a
> buffer once and for all. Hence this problem is solved in the working
> sources.

Very good. Thank you very much.

> ...
> Another semantical question is: should the call
> 
> sscanf "" "%[^\n\r]\n" (fun x -> x)
> 
> be successful or not ? If yes, what happens to your problem ?

With the present semantics, it should raise an exception. 
With the semantics of partial matches it should succeed.

> An interesting example indeed that helps precising the semantics of
> Scanf patterns and functions, thank you very much!
> 
> Pierre Weis

I humbly bow to your kindness. Thank you very much for 
sharing your work with all of us.

Alex

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-10-08 23:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-06 22:57 [Caml-list] Bug somewhere Alessandro Baretta
2002-10-06 23:06 ` Alessandro Baretta
2002-10-08 20:07   ` Pierre Weis
2002-10-08 21:26     ` Eric C. Cooper
2002-10-08 23:31     ` Alessandro Baretta
2002-10-07  8:03 ` Pierre Weis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).