caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* copy the rest of a file after scanning
@ 2008-03-17 23:29 viktor tron
  2008-03-18  1:33 ` [Caml-list] " Oliver Bandel
  0 siblings, 1 reply; 3+ messages in thread
From: viktor tron @ 2008-03-17 23:29 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 2259 bytes --]

Dear list,
I have the funniest problem.

I use Scan to scan a file, and output an edited variant, when edits are done
I just need to copy the remainder of the file.

This ludicrious task is proving more elusive to handle than the whole
project.

The problem is that when I finish editing, the scanning buffer is active and
not empty and
my reading position in the input channel is not where I am currently at in
scanning.


I don't want to use scanner to consume the remaining 3 terrabytes digesting
it line by line. I can't seem to be able to scan bigger chunks than lines
either unless I can name a character
that certainly does not appear in the text I am reading.
If there is a character like this, say @, then I am ok with

1)
scan "%s@@%!" (fun s -> Printf.fprintf corpus_out "%s" s)

this one reads till the next @ character which is ignored or the end of the
input, which is checked with putting %! explicitly.
This passes my tests, but horribly ugly, since there is no character that I
can guarantee this way.
Plus I might not have memory for passing this whole chunk as one string if
the file is large.

So as an alternative I did this:

2)
(* we set the input channel reading position to where we are in scanning *)
let _ = scan "%n" (fun x -> seek_in corpus_in (x - 1)) in
(* and then dump the rest trivially in chunks of buf_size chars *)
let buf = String.create buf_size in
let rec dump () =
let len = input corpus_in buf 0 buf_size in
if len > 0 then (output corpus_out buf 0 len; dump () )
in

or in one go

3)
let end_pos = in_channel_length corpus_in
let len = end_pos - pos_in corpus_in in
let s = String.create len in
let _ = really_input corpus_in s 0 len in
Printf.fprintf corpus_out "%s" s;

On my mac and linux, all works smoothly, till I used it on windows.
3) does not work on windows, since in_channel_length and seek do not take
into account the newline translations that take
place at reading.
Or in other words, the scan module reports character positions incorrectly
since CRLF=\013\010 is counted as one character and matched by \n.

But then again, 2 seems to work, but have no clue why seek is not the same
as in_channel_length when it comes to counting chars.
In which case it should not work either.

Any thoughts?

Viktor

[-- Attachment #2: Type: text/html, Size: 2511 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] copy the rest of a file after scanning
  2008-03-17 23:29 copy the rest of a file after scanning viktor tron
@ 2008-03-18  1:33 ` Oliver Bandel
  0 siblings, 0 replies; 3+ messages in thread
From: Oliver Bandel @ 2008-03-18  1:33 UTC (permalink / raw)
  To: caml-list

Hello,

so much text... not easy to follow you...
...there is no code for "scan"...

...but one sentence was there that might point to the problem...


Zitat von viktor tron <viktor.tron.ml@gmail.com>:
[...]
>
> On my mac and linux, all works smoothly, till I used it on windows.
> 3) does not work on windows, since in_channel_length and seek do not
> take
> into account the newline translations that take
> place at reading.
[...]


Did you used "open_in_bin"?

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 3+ messages in thread

* copy the rest of a file after scanning
@ 2008-03-17 18:00 viktor tron
  0 siblings, 0 replies; 3+ messages in thread
From: viktor tron @ 2008-03-17 18:00 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 2547 bytes --]

Dear list,
I have the funniest problem.

I use Scan to scan a file, and output an edited variant, when edits are done
I just need to copy the remainder of the file.

This ludicrious task is proving more elusive to handle than the whole
project.

The problem is that when I finish editing, the scanning buffer is active and
not empty and
my reading position in the input channel is not where I am currently
at in scanning.


I don't want to use scanner to consume the remaining 3 terrabytes digesting
it line by line. I can't seem to be able to scan bigger chunks than lines
either unless I can name a character
that certainly does not appear in the text I am reading.
If there is a character like this, say @, then I am ok with

1)
scan "%s@@%!" (fun s -> Printf.fprintf corpus_out "%s" s)

this one reads till the next @ character which is ignored or the end of the
input, which is checked with putting %! explicitly.
This passes my tests, but horribly ugly, since there is no character that I
can guarantee this way.
Plus I might not have memory for passing this whole chunk as one
string if the file is large.

So as an alternative I did this:

2)
(* we set the input channel reading position to where we are in scanning *)
let _ = scan "%n" (fun x -> seek_in corpus_in (x - 1)) in
(* and then dump the rest trivially in chunks of buf_size chars *)
let buf = String.create buf_size in
let rec dump () =
let len = input corpus_in buf 0 buf_size in
if len > 0 then (output corpus_out buf 0 len; dump () )
in

or in one go

3)
let end_pos = in_channel_length corpus_in
let len = end_pos - pos_in corpus_in in
let s = String.create len in
let _ = really_input corpus_in s 0 len in
Printf.fprintf corpus_out "%s" s;

On my mac and linux, all works smoothly, till I used it on windows.
These do not work on windows, since in_channel_length and seek do not take
into account the newline translations that take
place at reading.
Or in other words, the scan module reports character positions incorrectly
since CRLF=\013\010 is counted as one character and matched by \n.
I suspect 2 does not work either, since seek is probably the same as
in_channel_length when it comes to counting chars.

So there is no way to combine scan positions and in_channel/seek type
positions.

If there was a way to dump and empty the scanning buffer, then I could then
just use (2), since
then scanning pos and pos_in would align, but I found no way of doing that.

I have no idea how to solve this. Well, I guess I am missing something
trivial.

Thanks for help

Viktor

[-- Attachment #2: Type: text/html, Size: 3013 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-03-18  1:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-17 23:29 copy the rest of a file after scanning viktor tron
2008-03-18  1:33 ` [Caml-list] " Oliver Bandel
  -- strict thread matches above, loose matches on Subject: below --
2008-03-17 18:00 viktor tron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).