From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <weis@pauillac.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id 60355BB9C
	for <caml-list@yquem.inria.fr>; Fri, 18 Nov 2005 02:50:18 +0100 (CET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAI1oHGc007996
	for <caml-list@yquem.inria.fr>; Fri, 18 Nov 2005 02:50:17 +0100
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id CAA00969 for <caml-list@pauillac.inria.fr>; Fri, 18 Nov 2005 02:50:15 +0100 (MET)
Received: from ash25e.internode.on.net (ash25e.internode.on.net [203.16.214.182])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAI1oA0k007980
	for <caml-list@inria.fr>; Fri, 18 Nov 2005 02:50:14 +0100
Received: from rosella (ppp7-104.lns1.syd7.internode.on.net [59.167.7.104])
	by ash25e.internode.on.net (8.12.9/8.12.6) with ESMTP id jAI1nuxr001412;
	Fri, 18 Nov 2005 12:19:56 +1030 (CST)
	(envelope-from skaller@users.sourceforge.net)
Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
From: skaller <skaller@users.sourceforge.net>
To: Brian Hurt <bhurt@spnz.org>
Cc: Oliver Bandel <oliver@first.in-berlin.de>, caml-list@inria.fr
In-Reply-To: <Pine.LNX.4.63.0511171459120.24132@localhost.localdomain>
References: <20051116234238.GA5741@first.in-berlin.de>
	 <1132215328.9775.110.camel@rosella>
	 <Pine.LNX.4.63.0511170839560.24132@localhost.localdomain>
	 <1132248698.9668.44.camel@rosella>
	 <Pine.LNX.4.63.0511171205570.24132@localhost.localdomain>
	 <1132253831.9668.116.camel@rosella>
	 <Pine.LNX.4.63.0511171459120.24132@localhost.localdomain>
Content-Type: text/plain
Date: Fri, 18 Nov 2005 12:49:55 +1100
Message-Id: <1132278595.9668.127.camel@rosella>
Mime-Version: 1.0
X-Mailer: Evolution 2.4.1 
Content-Transfer-Encoding: 7bit
X-Miltered: at nez-perce with ID 437D3359.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at nez-perce with ID 437D3353.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 indexing:01 binary:01 inserting:01 cached:01 merging:01 nodes:01 node:01 node:01 nodes:01 yup:01 worst-case:01 wrote:01 sourceforge:01 precisely:01 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.3

On Thu, 2005-11-17 at 16:15 -0600, Brian Hurt wrote:

> 
> This is the worst possible case- that each block is half full.  Which 
> means that instead of log_k(N) blocks, you're having to touch log_{k/2}(N) 
> blocks.  This means that if N=2^32 and k=256, that you need to read 5 
> blocks instead of 4 (128^5 = 2^35).  And the number of blocks you need has 
> about doubled.  Also note that the binary search per block is now cheaper 
> (by one step), and the cost of inserting elements is half.
> 
> So the question becomes: is the performance advantage gained by 
> rebalancing worth the cost?

Yes, that's the question. And there is no single answer :)

Note, it is not 5 reads instead of 4, it is 3 reads instead of 2
(assuming the first two levels are cached).

A BTree system I used once was fixed at 3 levels. So it could
be kind of critical :)

> If I was worried about it, I'd be inclined to be more agressive on merging 
> and splitting nodes.  Basically, if the node is under 5/8th full, I'd look 
> to steal some children from siblings.  If the node is over 7/8th full, I'd 
> look to share some child with siblings.  Note that if you have three nodes 
> each 1/2 full, you can combine the three into two nodes, each 3/4th full. 
> You want to keep nodes about 3/4th full, as that makes it cheaper to add 
> and delete elements.

Yup. There are lots of possible tweaks :)

> Two problems with this: first, what happens when the sibling is full too, 
> you can get into a case where an insert is O(N) cost, and second, this is 
> assuming inserts only (I can still get to worst-case with deletes).

Depends precisely on the algorithm -- mine only looked once.
If the sibling was full, you just split as usual. Its a cheap 
hack :)

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net