caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Brian Hurt <bhurt@spnz.org>
To: skaller <skaller@users.sourceforge.net>
Cc: Oliver Bandel <oliver@first.in-berlin.de>, caml-list@inria.fr
Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
Date: Thu, 17 Nov 2005 09:09:29 -0600 (CST)	[thread overview]
Message-ID: <Pine.LNX.4.63.0511170839560.24132@localhost.localdomain> (raw)
In-Reply-To: <1132215328.9775.110.camel@rosella>



On Thu, 17 Nov 2005, skaller wrote:

> Balanced BTree guarantees extremely small constant time lookup
> for all keys. (typically 3 or 4 accesses)
> Many updates are also fast constant time. However
> the Btrees are very expensive to rebalance, and occasionally
> an update requires a global rebalancing which brings the
> world to a complete stop for a very long time.

Not as I understand BTrees.  BTrees are generializations of 2,3,4 trees. 
Each "node" of the tree has between k/2 and k (for k > 3) children (except 
for the root node, which can have anywhere from 2 to k children).  When 
adding a new child to a given node, if the number of children exceeds k, 
then the node is split into two nodes, each with (k+1)/2 children (if k is 
even, one of the two nodes gets the extra).  The new sibling is then added 
to the parent of the original node.  When the root node is split, then a 
new level is added above the root node (note, changing the depth of the 
entire tree at the same time).  Likewise, when children are removed, if 
the node falls below k/2 children, it's merged with one of it's siblings. 
If the root node drops to only having 1 child, it's removed, and it's lone 
child becomes the new root, again changing the depth of the entire tree at 
once.

For in-memory data structures, Btrees are less efficient than standard 
balanced binary trees.  See, the problem is that you do a search on the 
BTree, you have to do a binary search on the children at each node.  So 
the cost of doing a search on a BTree with N elements is log base k of N 
nodes times log base 2 of k for the binary search in each node.  A little 
bit of algebra proves that this is equal to the log base 2 of N, the same 
cost as searching a binary tree (basically).  Worse yet, when inserting 
or deleting an object, the average cost is O(k)- you have to insert 
or remove elements into/from the middle of an array.  If k > log base 2 of 
N (likely), then the standard balanced binary tree will be faster on 
inserts and deletes.  Where BTrees shine is in disk based data structures. 
Here, the main bottleneck is reading data off the disk.  For N=2^32 and 
k=256, a standard balanced binary tree would require up to 64 disk reads, 
the BTree only 5.

> Some modern variants amortise the costs differently,
> typically reducing the worst case at the expense
> of the other operations. In particular, the very
> worst way to populate a BTree from a list is if
> the list is already sorted, however a smart algorithm
> can build an empty skeleton first and populate it
> in linear time (provided only you know
> how many keys there are). Obviously, the tree
> has to be offline until this operation is completed.

Actually, if the data is already sorted, creating the tree should be O(N) 
cost.  At each level, you're adding children to the same node.  You keep 
adding children until you hit a limit (I'd suggest 3k/4 as a good limit). 
When one node is full, you create a new node (adding it to the next layer 
up) and start adding elements to it.  When you're done adding children, if 
the current node you're adding to has less than k/2 nodes, it gets merged 
with it's previous sibling.  Proving that this is O(N) total cost is left 
as an exercise to the reader.

Brian


  reply	other threads:[~2005-11-17 15:04 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-16 23:42 Oliver Bandel
2005-11-17  8:15 ` [Caml-list] " skaller
2005-11-17 15:09   ` Brian Hurt [this message]
2005-11-17 17:31     ` skaller
2005-11-17 18:08       ` Brian Hurt
2005-11-17 18:57         ` skaller
2005-11-17 22:15           ` Brian Hurt
2005-11-18  1:49             ` skaller
2005-11-17  8:35 ` Florian Hars
2005-11-17  9:24   ` Oliver Bandel
2005-11-17 12:39     ` Florian Weimer
2005-11-17 20:57       ` Oliver Bandel
2005-11-17 22:02         ` Florian Weimer
2005-11-17 11:49 ` Florian Weimer
2005-11-17 13:55   ` Richard Jones
2005-11-18 14:54   ` Jonathan Bryant
2005-11-18 14:22     ` Oliver Bandel
2005-11-18 14:37       ` Florian Weimer
2005-11-18 15:05         ` Thomas Fischbacher
2005-11-18 15:14           ` Florian Weimer
2005-11-18 16:03             ` Thomas Fischbacher
2005-11-18 20:03               ` Gerd Stolpmann
2005-11-18 20:01             ` Gerd Stolpmann
2005-11-18 21:12               ` Florian Weimer
2005-11-18 16:13         ` Oliver Bandel
2005-11-18 14:45     ` Florian Weimer
     [not found] ` <437CD0E5.8080503@yahoo.fr>
2005-11-17 20:02   ` Oliver Bandel
     [not found]     ` <437CE8EC.1070109@yahoo.fr>
2005-11-17 20:41       ` Oliver Bandel
2005-11-18 15:06         ` Florian Hars
     [not found] ` <437BD5F5.6010307@1969web.com>
2005-11-17 20:10   ` Oliver Bandel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.63.0511170839560.24132@localhost.localdomain \
    --to=bhurt@spnz.org \
    --cc=caml-list@inria.fr \
    --cc=oliver@first.in-berlin.de \
    --cc=skaller@users.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).