caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Markus Mottl" <markus.mottl@gmail.com>
To: ocaml <caml-list@inria.fr>
Subject: Marshalling data format deteriorates compressibility
Date: Tue, 27 Jun 2006 17:04:55 -0400	[thread overview]
Message-ID: <f8560b80606271404k43c7eacer8a659395c4d1674@mail.gmail.com> (raw)

Hi,

we have just found out that under certain circumstances the current
binary format used for encoding marshalled OCaml-values badly affects
compressibility of the data by a wide class of compression algorithms
(especially Lempel-Ziv based ones).

Before we write out datastructures to disk we usually hashcons them to
maximize sharing.  This, of course, leads to much smaller file sizes.
Suprisingly, however, compressing the hashconsed files using e.g. gzip
leads to very significantly (e.g. three times!) larger files than
gzipping marshalled datastructures that have not been hashconsed.

We finally found out what causes the problem: OCaml represents
pointers to shared data values using relative offsets instead of
absolute positions within the marshalled data.  This means that e.g.
an array containing pointers to these values will be represented by a
sequence of increasing relative offsets, which essentially renders it
almost incompressible to the usual compression algorithms.

As it seems, the current marshalling algorithm uses this relative
addressing approach to save space: relative offsets are encoded with
variable length (this assumes some degree of locality), which is not
possible with absolute addressing.  Unfortunately, this does not take
compression algorithms into account, which may greatly benefit from
repeating patterns of pointers.

Could one of the maintainers please tell us what would be the least
intrusive way of adding a new marshalling format?  If this data format
were added to the distribution, it could be turned on using a new flag
in the Marshal-module.  A different magic number might be used for
chosing the right unmarshalling function.

Of course, we would be even more grateful if there were an
implementation of the alternative encoding in the next release... ;-)

Regards,
Markus

-- 
Markus Mottl        http://www.ocaml.info        markus.mottl@gmail.com


             reply	other threads:[~2006-06-27 21:04 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-27 21:04 Markus Mottl [this message]
2006-06-28  0:00 ` [Caml-list] " Jon Harrop

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f8560b80606271404k43c7eacer8a659395c4d1674@mail.gmail.com \
    --to=markus.mottl@gmail.com \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).