From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 6E15BBB84 for ; Tue, 27 Jun 2006 23:04:58 +0200 (CEST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id k5RL4wxt015348 for ; Tue, 27 Jun 2006 23:04:58 +0200 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id XAA03023 for ; Tue, 27 Jun 2006 23:04:58 +0200 (MET DST) Received: from wx-out-0102.google.com (wx-out-0102.google.com [66.249.82.203]) by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k5RL4vvt031739 for ; Tue, 27 Jun 2006 23:04:57 +0200 Received: by wx-out-0102.google.com with SMTP id t4so1050358wxc for ; Tue, 27 Jun 2006 14:04:56 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=RrEZksczqmN8D3h38j1ELzrSEThFwfR4nu1VCEEdMKKx5sBD41XSSeIm5+ed1JVc76NMeaiZ5GZAsZX4AzD+6y0sXNtqqVX9QXeg183dKHOQDGOsP1ECEJoVJDWRoy2CGfiZG/Bu79cw1sj7JXM7GEuXkDj7BU4XXmziiOKEq3w= Received: by 10.70.92.2 with SMTP id p2mr198295wxb; Tue, 27 Jun 2006 14:04:55 -0700 (PDT) Received: by 10.70.55.19 with HTTP; Tue, 27 Jun 2006 14:04:55 -0700 (PDT) Message-ID: Date: Tue, 27 Jun 2006 17:04:55 -0400 From: "Markus Mottl" To: ocaml Subject: Marshalling data format deteriorates compressibility MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-j-chkmail-Score: MSGID : 44A19D79.000 on nez-perce : j-chkmail score : X : 0/20 1 X-Miltered: at concorde with ID 44A19D7A.002 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 44A19D79.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; markus:01 mottl:01 markus:01 mottl:01 marshalling:01 binary:01 ocaml-values:01 ocaml:01 pointers:01 pointers:01 renders:01 marshalling:01 locality:01 chosing:01 ocaml:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.5 required=5.0 tests=INFO_TLD,RCVD_BY_IP autolearn=disabled version=3.0.3 Hi, we have just found out that under certain circumstances the current binary format used for encoding marshalled OCaml-values badly affects compressibility of the data by a wide class of compression algorithms (especially Lempel-Ziv based ones). Before we write out datastructures to disk we usually hashcons them to maximize sharing. This, of course, leads to much smaller file sizes. Suprisingly, however, compressing the hashconsed files using e.g. gzip leads to very significantly (e.g. three times!) larger files than gzipping marshalled datastructures that have not been hashconsed. We finally found out what causes the problem: OCaml represents pointers to shared data values using relative offsets instead of absolute positions within the marshalled data. This means that e.g. an array containing pointers to these values will be represented by a sequence of increasing relative offsets, which essentially renders it almost incompressible to the usual compression algorithms. As it seems, the current marshalling algorithm uses this relative addressing approach to save space: relative offsets are encoded with variable length (this assumes some degree of locality), which is not possible with absolute addressing. Unfortunately, this does not take compression algorithms into account, which may greatly benefit from repeating patterns of pointers. Could one of the maintainers please tell us what would be the least intrusive way of adding a new marshalling format? If this data format were added to the distribution, it could be turned on using a new flag in the Marshal-module. A different magic number might be used for chosing the right unmarshalling function. Of course, we would be even more grateful if there were an implementation of the alternative encoding in the next release... ;-) Regards, Markus -- Markus Mottl http://www.ocaml.info markus.mottl@gmail.com