caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* integration of compression with channels
@ 2008-11-04 12:50 Eric Cooper
  2008-11-04 13:26 ` [Caml-list] " Stefano Zacchiroli
  0 siblings, 1 reply; 2+ messages in thread
From: Eric Cooper @ 2008-11-04 12:50 UTC (permalink / raw)
  To: caml-list

I was interested to see Zack's work on integrating gzip and bzip2 with
I/O channels:
    http://upsilon.cc/~zack/blog/posts/2008/11/ocaml_batteries_gzip/

I initially tried something like this in the approx proxy server, but
found out the hard way that it was difficult to deal with corrupt .gz
files.  You might only discover the corruption after reading garbage
for a while, and an exception at that point would be unexpected.

Eventually I switched to spawning a "gunzip" process to a temporary
file, and then reading that.  In addition to detecting corruption
early, it was also significantly faster than CamlZip.

I suppose one could argue that you can get an I/O error even from
reading an uncompressed file (bad disk block, or whatever), and that
a robust program should be equally prepared to deal with that.
But I think there's a real difference in practice.

The integrated approach is definitely more elegant, and perhaps the
performance will be competitive someday.  So I'd be interested
if anyone has a better way of handling potentially corrupt files.

-- 
Eric Cooper             e c c @ c m u . e d u


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Caml-list] integration of compression with channels
  2008-11-04 12:50 integration of compression with channels Eric Cooper
@ 2008-11-04 13:26 ` Stefano Zacchiroli
  0 siblings, 0 replies; 2+ messages in thread
From: Stefano Zacchiroli @ 2008-11-04 13:26 UTC (permalink / raw)
  To: caml-list

On Tue, Nov 04, 2008 at 07:50:20AM -0500, Eric Cooper wrote:
> I initially tried something like this in the approx proxy server, but
> found out the hard way that it was difficult to deal with corrupt .gz
> files.  You might only discover the corruption after reading garbage
> for a while, and an exception at that point would be unexpected.

I think you are trying to fight with an intrinsic underlying problem.

Let's take the extreme end of integrity checks: checksum on the whole
file. To be able to check that you need to see all the file in
advance, compute its checksum, and compare with the expected checksum.
On the other hand, abstractions like channels are precisely meant to
read files in a streaming fashion, rather than all together.

Bottom-line: there is a trade-off among "streamability" and integrity
checks, it is up to you to choose where to put yourself in the
trade-off.

Actually, often it is not even up to you, but rather up to the file
format you are reading.  I don't know the gory details of the GZip
format, but Camlzip does some sanity checks on GZip headers, spotting
*some* of the possible header corruptions. It might be that you hit
some corruption cases not implemented by Camlzip, in that case the
proper solution is to add those checks to Camlzip.  On the other hand,
if you want to spot in advance corruptions which occur later on in the
compressed file (and I don't know if GZip supports that or not) you
have no choice beside buffering.

Cheers.

-- 
Stefano Zacchiroli -*- PhD in Computer Science \ PostDoc @ Univ. Paris 7
zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/
Dietro un grande uomo c'è sempre /oo\ All one has to do is hit the right
uno zaino        -- A.Bergonzoni \__/ keys at the right time -- J.S.Bach


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2008-11-04 13:35 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-04 12:50 integration of compression with channels Eric Cooper
2008-11-04 13:26 ` [Caml-list] " Stefano Zacchiroli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).