From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id C2C66BB84 for ; Wed, 28 Jun 2006 16:48:52 +0200 (CEST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id k5SEmr4Q028116 for ; Wed, 28 Jun 2006 16:48:53 +0200 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA20213 for ; Wed, 28 Jun 2006 16:48:52 +0200 (MET DST) Received: from out3.smtp.messagingengine.com (out3.smtp.messagingengine.com [66.111.4.27]) by nez-perce.inria.fr (8.13.6/8.13.6) with ESMTP id k5SEmoIh024048 for ; Wed, 28 Jun 2006 16:48:52 +0200 Received: from frontend3.internal (frontend3.internal [10.202.2.152]) by frontend1.messagingengine.com (Postfix) with ESMTP id C9FD4D87C6E for ; Tue, 27 Jun 2006 19:31:44 -0400 (EDT) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by frontend3.internal (MEProxy); Tue, 27 Jun 2006 19:31:46 -0400 X-Sasl-enc: asrwiQodon61y83FHET3Ad8DCQOI+uLtGCVK2Z0GG7jY 1151451105 Received: from munge (burnham.ljcrf.edu [192.231.106.2]) by mail.messagingengine.com (Postfix) with ESMTP id 8A5626DB3 for ; Tue, 27 Jun 2006 19:31:45 -0400 (EDT) Date: Tue, 27 Jun 2006 16:31:03 -0700 (PDT) From: Martin Jambon X-X-Sender: martin@munge To: caml-list@inria.fr Subject: Re: [Caml-list] Marshalling data format deteriorates compressibility Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Miltered: at concorde with ID 44A296D5.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 44A296D2.002 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; ocaml:01 marshalling:01 markus:01 mottl:01 ocaml:01 pointers:01 pointers:01 renders:01 marshalling:01 locality:01 rounding:01 wrote:01 caml-list:01 algorithm:01 int:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 [to the list admin: why can I send messages with this subscription and not with martin_jambon@emailuser.net?] On Tue, 27 Jun 2006, Markus Mottl wrote: > We finally found out what causes the problem: OCaml represents > pointers to shared data values using relative offsets instead of > absolute positions within the marshalled data. This means that e.g. > an array containing pointers to these values will be represented by a > sequence of increasing relative offsets, which essentially renders it > almost incompressible to the usual compression algorithms. > > As it seems, the current marshalling algorithm uses this relative > addressing approach to save space: relative offsets are encoded with > variable length (this assumes some degree of locality), which is not > possible with absolute addressing. Unfortunately, this does not take > compression algorithms into account, which may greatly benefit from > repeating patterns of pointers. Maybe you can convert the data into a marshal-optimized format before marshalling, where you put your shared data into an array, and substitute the pointers by array indices, e.g. type t = string list becomes: type marshalled_t = (string array * int list) ["a"; "b"; "a"; "a"] -> ([| "a"; "b" |], [0; 1; 0; 0]) That seems like a lot of work, but it shouldn't be too hard to maintain. By the way, floats don't compress very well. Rounding them as much as possible used to save me about 50% of space. Martin -- Martin Jambon, PhD http://martin.jambon.free.fr