From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p8JKHtWN017557 for ; Mon, 19 Sep 2011 22:17:55 +0200 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0EADGjd07DDgCO/2dsb2JhbABCpz14gVMBAQQBOhwjBQsLDjg5HgYTh3cCtRyGeASdW4cd X-IronPort-AV: E=Sophos;i="4.68,407,1312149600"; d="scan'208";a="120455150" Received: from 195-14-0-142.nuxit.net (HELO de558.ispfr.net) ([195.14.0.142]) by mail1-smtp-roc.national.inria.fr with ESMTP/TLS/AES256-SHA; 19 Sep 2011 22:17:27 +0200 Received: from ours.starynkevitch.net ([213.41.244.95] helo=glinka.lesours) by de558.ispfr.net with smtp (Exim 4.72) (envelope-from ) id 1R5kHN-0003iO-HG; Mon, 19 Sep 2011 22:17:25 +0200 Date: Mon, 19 Sep 2011 22:17:18 +0200 From: Basile Starynkevitch To: Ian Zimmerman Cc: caml-list@inria.fr Message-Id: <20110919221718.16929df0551299a720ca6def@starynkevitch.net> In-Reply-To: <87sjnsl2f0.fsf@foolinux.dyndns.org> References: <87sjnsl2f0.fsf@foolinux.dyndns.org> X-Mailer: Sylpheed 3.2.0beta3 (GTK+ 2.24.6; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [Caml-list] Hashtbl or Set? On Mon, 19 Sep 2011 13:09:39 -0700 Ian Zimmerman wrote: > > I need a somewhat large (thousands) set of strings, created once during > startup and never modified after. You didn't tell us why you need such as set, and what kind of strings are inside. (I mean that if the strings are file contents -e.g. typically 100kbytes long-, it is different than if the strings are dictionnary words or source program identifiers - e.g. typically a dozen of bytes each-) > What is a better choice, a (string, unit) Hashtbl.t or the Set module? > If the Set module still uses trees > as it did when I was young :-), access will be logarithmic versus > constant for Hashtbl. But on the other hand a hash function must > examine all of every string while the comparison of 2 strings stops at > the first nonmatching character. Honestly, I believe that for practical purposes, unless you have very strange constraints, both approaches are good enough. Don't forget that too early optimization is evil. > I am thinking about the time to build the set as well as the probing > time. Make it a module, that is, your abstract data type. Then, if you discover later that this module is a bottleneck, optimize it further. Honestly, I think you should not care that much in the first place. (the real timing may also depend of memory cache considerations and sizes). Cheers. -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basilestarynkevitchnet mobile: +33 6 8501 2359 8, rue de la Faiencerie, 92340 Bourg La Reine, France *** opinions {are only mine, sont seulement les miennes} ***