From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id D9A70BB9C for ; Thu, 17 Nov 2005 09:15:48 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAH8Flgf013659 for ; Thu, 17 Nov 2005 09:15:48 +0100 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id JAA11235 for ; Thu, 17 Nov 2005 09:15:47 +0100 (MET) Received: from ash25e.internode.on.net (ash25e.internode.on.net [203.16.214.182]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAH8FjHL003012 for ; Thu, 17 Nov 2005 09:15:46 +0100 Received: from rosella (ppp7-104.lns1.syd7.internode.on.net [59.167.7.104]) by ash25e.internode.on.net (8.12.9/8.12.6) with ESMTP id jAH8FT0s075463; Thu, 17 Nov 2005 18:45:34 +1030 (CST) (envelope-from skaller@users.sourceforge.net) Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) From: skaller To: Oliver Bandel Cc: caml-list@inria.fr In-Reply-To: <20051116234238.GA5741@first.in-berlin.de> References: <20051116234238.GA5741@first.in-berlin.de> Content-Type: text/plain Date: Thu, 17 Nov 2005 19:15:28 +1100 Message-Id: <1132215328.9775.110.camel@rosella> Mime-Version: 1.0 X-Mailer: Evolution 2.4.1 Content-Transfer-Encoding: 7bit X-Miltered: at concorde with ID 437C3C33.002 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 437C3C31.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 indexing:01 oliver:01 bandel:01 substrings:01 nodes:01 variants:01 ,...:98 wrote:01 sourceforge:01 structures:01 algorithm:01 underlying:01 hierarchical:02 data:02 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 On Thu, 2005-11-17 at 00:42 +0100, Oliver Bandel wrote: > So, say we have 10^6 texts that we want ot have an index for, > to retrieve the text according to some parts of the text > (keywords, substrings,...). > We want to make an index from these texts. BTree or modern equivalent may be an option. NTFS directories are BTrees. Balanced BTree guarantees extremely small constant time lookup for all keys. (typically 3 or 4 accesses) Many updates are also fast constant time. However the Btrees are very expensive to rebalance, and occasionally an update requires a global rebalancing which brings the world to a complete stop for a very long time. BTree is ideal for large data structures, since the nodes are chosen to match up with disk block sizes, and the underlying I/O is done with low level I/O calls, that is, BTree is excellent for hierarchical storage media (such as virtual memory). One system I worked on used BTrees with spill list for those updates which would require a rebalancing. The merge/rebalancing is then done whilst the tree is offline, accesses are slowed down by the spill list until the rebalancing is done. Btree is probably good for a search engine where you run two trees on two servers and do the rebalancing on the offline one (then swap servers). Some modern variants amortise the costs differently, typically reducing the worst case at the expense of the other operations. In particular, the very worst way to populate a BTree from a list is if the list is already sorted, however a smart algorithm can build an empty skeleton first and populate it in linear time (provided only you know how many keys there are). Obviously, the tree has to be offline until this operation is completed. -- John Skaller Felix, successor to C++: http://felix.sf.net