caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: skaller <skaller@users.sourceforge.net>
To: Oliver Bandel <oliver@first.in-berlin.de>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
Date: Thu, 17 Nov 2005 19:15:28 +1100	[thread overview]
Message-ID: <1132215328.9775.110.camel@rosella> (raw)
In-Reply-To: <20051116234238.GA5741@first.in-berlin.de>

On Thu, 2005-11-17 at 00:42 +0100, Oliver Bandel wrote:

> So, say we have 10^6 texts that we want ot have an index for,
> to retrieve the text according to some parts of the text
> (keywords, substrings,...).
> We want to make an index from these texts.

BTree or modern equivalent may be an option.

NTFS directories are BTrees.

Balanced BTree guarantees extremely small constant time lookup
for all keys. (typically 3 or 4 accesses) 
Many updates are also fast constant time. However
the Btrees are very expensive to rebalance, and occasionally
an update requires a global rebalancing which brings the
world to a complete stop for a very long time.

BTree is ideal for large data structures, since the nodes
are chosen to match up with disk block sizes, and the underlying
I/O is done with low level I/O calls, that is, BTree is excellent
for hierarchical storage media (such as virtual memory).

One system I worked on used BTrees with spill list for
those updates which would require a rebalancing.
The merge/rebalancing is then done whilst the tree
is offline, accesses are slowed down by the spill
list until the rebalancing is done.

Btree is probably good for a search engine where you 
run two trees on two servers and do the rebalancing
on the offline one (then swap servers).

Some modern variants amortise the costs differently,
typically reducing the worst case at the expense
of the other operations. In particular, the very
worst way to populate a BTree from a list is if
the list is already sorted, however a smart algorithm
can build an empty skeleton first and populate it
in linear time (provided only you know
how many keys there are). Obviously, the tree
has to be offline until this operation is completed.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


  reply	other threads:[~2005-11-17  8:15 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-16 23:42 Oliver Bandel
2005-11-17  8:15 ` skaller [this message]
2005-11-17 15:09   ` [Caml-list] " Brian Hurt
2005-11-17 17:31     ` skaller
2005-11-17 18:08       ` Brian Hurt
2005-11-17 18:57         ` skaller
2005-11-17 22:15           ` Brian Hurt
2005-11-18  1:49             ` skaller
2005-11-17  8:35 ` Florian Hars
2005-11-17  9:24   ` Oliver Bandel
2005-11-17 12:39     ` Florian Weimer
2005-11-17 20:57       ` Oliver Bandel
2005-11-17 22:02         ` Florian Weimer
2005-11-17 11:49 ` Florian Weimer
2005-11-17 13:55   ` Richard Jones
2005-11-18 14:54   ` Jonathan Bryant
2005-11-18 14:22     ` Oliver Bandel
2005-11-18 14:37       ` Florian Weimer
2005-11-18 15:05         ` Thomas Fischbacher
2005-11-18 15:14           ` Florian Weimer
2005-11-18 16:03             ` Thomas Fischbacher
2005-11-18 20:03               ` Gerd Stolpmann
2005-11-18 20:01             ` Gerd Stolpmann
2005-11-18 21:12               ` Florian Weimer
2005-11-18 16:13         ` Oliver Bandel
2005-11-18 14:45     ` Florian Weimer
     [not found] ` <437CD0E5.8080503@yahoo.fr>
2005-11-17 20:02   ` Oliver Bandel
     [not found]     ` <437CE8EC.1070109@yahoo.fr>
2005-11-17 20:41       ` Oliver Bandel
2005-11-18 15:06         ` Florian Hars
     [not found] ` <437BD5F5.6010307@1969web.com>
2005-11-17 20:10   ` Oliver Bandel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1132215328.9775.110.camel@rosella \
    --to=skaller@users.sourceforge.net \
    --cc=caml-list@inria.fr \
    --cc=oliver@first.in-berlin.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).