From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by yquem.inria.fr (Postfix) with ESMTP id B261BBB9C for ; Thu, 17 Nov 2005 19:57:30 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAHIvU3b004244 for ; Thu, 17 Nov 2005 19:57:30 +0100 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id TAA25475 for ; Thu, 17 Nov 2005 19:57:29 +0100 (MET) Received: from smtp3.adl2.internode.on.net (smtp3.adl2.internode.on.net [203.16.214.203]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAHIvRW0004241 for ; Thu, 17 Nov 2005 19:57:28 +0100 Received: from rosella (ppp7-104.lns1.syd7.internode.on.net [59.167.7.104]) by smtp3.adl2.internode.on.net (8.12.9/8.12.6) with ESMTP id jAHIvBY7046188; Fri, 18 Nov 2005 05:27:16 +1030 (CST) (envelope-from skaller@users.sourceforge.net) Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) From: skaller To: Brian Hurt Cc: Oliver Bandel , caml-list@inria.fr In-Reply-To: References: <20051116234238.GA5741@first.in-berlin.de> <1132215328.9775.110.camel@rosella> <1132248698.9668.44.camel@rosella> Content-Type: text/plain Date: Fri, 18 Nov 2005 05:57:11 +1100 Message-Id: <1132253831.9668.116.camel@rosella> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Miltered: at nez-perce with ID 437CD29A.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 437CD297.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 indexing:01 yup:01 doubles:01 tweak:01 wrote:01 sourceforge:01 algorithm:01 data:02 data:02 tree:02 tree:02 tricky:02 brian:04 depends:04 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 On Thu, 2005-11-17 at 12:08 -0600, Brian Hurt wrote: > > I'm not sure what it is we disagree on. > > They don't ever need global rebalancing. Yup, they do not need it to maintain the invariant you stated. But performance can still benefit from it significantly. Two quite distinct BTrees can contain the same data exactly, it depends on the order of insertion of keys. If you are clever, you can fill up a BTree so every block is exactly full .. however you don't need to be clever to get the worst possible BTree -- just fill an empty tree with sorted data and almost all the blocks are guaranteed to be half full (the worst possible case for storage use and access time). Yet, this is a common case in practice. ** all the blocks will be half full except those on the right edge of the tree -- for a tree of depth 5 that's millions of half full blocks, and only 5 that can possibly be more than half full :) And in between, blocks can be filled to different extents. Rebalancing adjusts this. The basic algorithm you describe has MANY tweaks which attempt to rebalance the tree. The version I played with did 'scrolling', where instead of splitting an overfull block, you scroll a key (and child) across to one of its siblings via the parent. This is quite tricky .. but it fixes the sorted insertion problem nicely -- with this modification, all the blocks are always FULL .. except those on the right most edge and their left siblings: they grow until they're both full, then the rightmost one (only) is split. So at worst, you waste 5 blocks out of millions.. basically this doubles the capacity of the tree (built from a sorted list). The same tweak can be used on random insertions the same way, however the extra reads and write may or may not be worth it, depending on how valuable your disk storage is (etc). -- John Skaller Felix, successor to C++: http://felix.sf.net