caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* Marshalling data format deteriorates compressibility
@ 2006-06-27 21:04 Markus Mottl
  2006-06-28  0:00 ` [Caml-list] " Jon Harrop
  0 siblings, 1 reply; 2+ messages in thread
From: Markus Mottl @ 2006-06-27 21:04 UTC (permalink / raw)
  To: ocaml

Hi,

we have just found out that under certain circumstances the current
binary format used for encoding marshalled OCaml-values badly affects
compressibility of the data by a wide class of compression algorithms
(especially Lempel-Ziv based ones).

Before we write out datastructures to disk we usually hashcons them to
maximize sharing.  This, of course, leads to much smaller file sizes.
Suprisingly, however, compressing the hashconsed files using e.g. gzip
leads to very significantly (e.g. three times!) larger files than
gzipping marshalled datastructures that have not been hashconsed.

We finally found out what causes the problem: OCaml represents
pointers to shared data values using relative offsets instead of
absolute positions within the marshalled data.  This means that e.g.
an array containing pointers to these values will be represented by a
sequence of increasing relative offsets, which essentially renders it
almost incompressible to the usual compression algorithms.

As it seems, the current marshalling algorithm uses this relative
addressing approach to save space: relative offsets are encoded with
variable length (this assumes some degree of locality), which is not
possible with absolute addressing.  Unfortunately, this does not take
compression algorithms into account, which may greatly benefit from
repeating patterns of pointers.

Could one of the maintainers please tell us what would be the least
intrusive way of adding a new marshalling format?  If this data format
were added to the distribution, it could be turned on using a new flag
in the Marshal-module.  A different magic number might be used for
chosing the right unmarshalling function.

Of course, we would be even more grateful if there were an
implementation of the alternative encoding in the next release... ;-)

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Caml-list] Marshalling data format deteriorates compressibility
  2006-06-27 21:04 Marshalling data format deteriorates compressibility Markus Mottl
@ 2006-06-28  0:00 ` Jon Harrop
  0 siblings, 0 replies; 2+ messages in thread
From: Jon Harrop @ 2006-06-28  0:00 UTC (permalink / raw)
  To: caml-list

On Tuesday 27 June 2006 22:04, Markus Mottl wrote:
> As it seems, the current marshalling algorithm uses this relative
> addressing approach to save space: relative offsets are encoded with
> variable length (this assumes some degree of locality), which is not
> possible with absolute addressing.  Unfortunately, this does not take
> compression algorithms into account, which may greatly benefit from
> repeating patterns of pointers.

Interesting. Perhaps using a higher-order bitwise arithmetic coder instead of 
*zip would be the least intrusive solution? Alternatively, a custom rewrite 
tool in order to evade the compiler's internals?

-- 
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2006-06-28  0:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-27 21:04 Marshalling data format deteriorates compressibility Markus Mottl
2006-06-28  0:00 ` [Caml-list] " Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).