From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <weis@pauillac.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78])
	by yquem.inria.fr (Postfix) with ESMTP id 7FFFBBB9C
	for <caml-list@yquem.inria.fr>; Thu, 17 Nov 2005 23:10:24 +0100 (CET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAHMANej021123
	for <caml-list@yquem.inria.fr>; Thu, 17 Nov 2005 23:10:23 +0100
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id XAA29415 for <caml-list@pauillac.inria.fr>; Thu, 17 Nov 2005 23:10:23 +0100 (MET)
Received: from conn.mc.mpls.visi.com (conn.mc.mpls.visi.com [208.42.156.2])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAHMAMxk014777
	for <caml-list@inria.fr>; Thu, 17 Nov 2005 23:10:22 +0100
Received: from [192.168.42.2] (bhurt.dsl.visi.com [208.42.141.66])
	by conn.mc.mpls.visi.com (Postfix) with ESMTP
	id E40978F2A; Thu, 17 Nov 2005 16:10:21 -0600 (CST)
Date: Thu, 17 Nov 2005 16:15:05 -0600 (CST)
From: Brian Hurt <bhurt@spnz.org>
X-X-Sender: bhurt@localhost.localdomain
To: skaller <skaller@users.sourceforge.net>
Cc: Oliver Bandel <oliver@first.in-berlin.de>, caml-list@inria.fr
Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
In-Reply-To: <1132253831.9668.116.camel@rosella>
Message-ID: <Pine.LNX.4.63.0511171459120.24132@localhost.localdomain>
References: <20051116234238.GA5741@first.in-berlin.de>  <1132215328.9775.110.camel@rosella>
  <Pine.LNX.4.63.0511170839560.24132@localhost.localdomain> 
 <1132248698.9668.44.camel@rosella>  <Pine.LNX.4.63.0511171205570.24132@localhost.localdomain>
 <1132253831.9668.116.camel@rosella>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Miltered: at nez-perce with ID 437CFFCF.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at concorde with ID 437CFFCE.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; caml-list:01 indexing:01 yup:01 binary:01 inserting:01 merging:01 nodes:01 node:01 node:01 nodes:01 doubles:01 worst-case:01 wrote:01 wrote:01 data:02 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled 
	version=3.0.3


On Fri, 18 Nov 2005, skaller wrote:

> On Thu, 2005-11-17 at 12:08 -0600, Brian Hurt wrote:
>
>>> I'm not sure what it is we disagree on.
>>
>> They don't ever need global rebalancing.
>
> Yup, they do not need it to maintain the invariant
> you stated. But performance can still benefit from it
> significantly.
>
> Two quite distinct BTrees can contain the same data exactly,
> it depends on the order of insertion of keys. If you are
> clever, you can fill up a BTree so every block is exactly
> full .. however you don't need to be clever to get the worst
> possible BTree -- just fill an empty tree with sorted
> data and almost all the blocks are guaranteed to be half
> full (the worst possible case for storage use and
> access time). Yet, this is a common case in practice.
> ** all the blocks will be half full except those on
> the right edge of the tree -- for a tree of depth 5
> that's millions of half full blocks, and only 5 that
> can possibly be more than half full :)

This is the worst possible case- that each block is half full.  Which 
means that instead of log_k(N) blocks, you're having to touch log_{k/2}(N) 
blocks.  This means that if N=2^32 and k=256, that you need to read 5 
blocks instead of 4 (128^5 = 2^35).  And the number of blocks you need has 
about doubled.  Also note that the binary search per block is now cheaper 
(by one step), and the cost of inserting elements is half.

So the question becomes: is the performance advantage gained by 
rebalancing worth the cost?

If I was worried about it, I'd be inclined to be more agressive on merging 
and splitting nodes.  Basically, if the node is under 5/8th full, I'd look 
to steal some children from siblings.  If the node is over 7/8th full, I'd 
look to share some child with siblings.  Note that if you have three nodes 
each 1/2 full, you can combine the three into two nodes, each 3/4th full. 
You want to keep nodes about 3/4th full, as that makes it cheaper to add 
and delete elements.

> The version I played with did 'scrolling', where
> instead of splitting an overfull block, you scroll
> a key (and child) across to one of its siblings
> via the parent. This is quite tricky ..  but it
> fixes the sorted insertion problem nicely --
> with this modification, all the blocks are
> always FULL .. except those on the right most
> edge and their left siblings: they grow until
> they're both full, then the rightmost one (only)
> is split. So at worst, you waste 5 blocks
> out of millions.. basically this doubles the
> capacity of the tree (built from a sorted list).

Two problems with this: first, what happens when the sibling is full too, 
you can get into a case where an insert is O(N) cost, and second, this is 
assuming inserts only (I can still get to worst-case with deletes).

Brian