caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Florian Weimer <fw@deneb.enyo.de>
To: Oliver Bandel <oliver@first.in-berlin.de>
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
Date: Thu, 17 Nov 2005 12:49:55 +0100	[thread overview]
Message-ID: <87r79fjxy4.fsf@mid.deneb.enyo.de> (raw)
In-Reply-To: <20051116234238.GA5741@first.in-berlin.de> (Oliver Bandel's message of "Thu, 17 Nov 2005 00:42:39 +0100")

* Oliver Bandel:

> So, say we have 10^6 texts that we want ot have an index for,
> to retrieve the text according to some parts of the text
> (keywords, substrings,...).
> We want to make an index from these texts.

B-trees are often used for that purpose.

> After a while we get 10^5 new texts and want to extend
> the exisiting index, so that the whole index not necessarily
> must be created again, with the indexer-tool running on
> all files (^10^6 + 10^5) again, but only have to index the new files,
> but the big index can be extended with additional smaller indizes.

10^6 documents won't fit into available RAM.  What's worse, you cannot
afford to reindex everything just because the machine crashed.  This
means that you need some form of crash recovery (e.g. write-ahead
logging).

> Is there something like that already existing?

Plenty.  Berkeley DB, SQLite, full-blown SQL database servers like
PostgreSQL or MySQL.  The list is pretty long.

> It's mainly a question on datastructures/algorithms, so this mailing list
> may be the wrong, but the reason to aske here is: Are functional
> datastructures in some way good for implementing such tools?

They are called "log-structured databases".  Typically, they offer
superior write performance (especially if you can waste a lot of disk
space and delay garbage collection), but read access cannot match
performance of more traditional approaches (like B-trees plus
write-ahead logging).

> (Maybe using OCaml, but the imperative features of it would help,
>  if the functional features would be too slow?)

If I were you, I'd reuse existing libraries (not written in Objective
Caml), so this question wouldn't be relevant.


  parent reply	other threads:[~2005-11-17 11:50 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-16 23:42 Oliver Bandel
2005-11-17  8:15 ` [Caml-list] " skaller
2005-11-17 15:09   ` Brian Hurt
2005-11-17 17:31     ` skaller
2005-11-17 18:08       ` Brian Hurt
2005-11-17 18:57         ` skaller
2005-11-17 22:15           ` Brian Hurt
2005-11-18  1:49             ` skaller
2005-11-17  8:35 ` Florian Hars
2005-11-17  9:24   ` Oliver Bandel
2005-11-17 12:39     ` Florian Weimer
2005-11-17 20:57       ` Oliver Bandel
2005-11-17 22:02         ` Florian Weimer
2005-11-17 11:49 ` Florian Weimer [this message]
2005-11-17 13:55   ` Richard Jones
2005-11-18 14:54   ` Jonathan Bryant
2005-11-18 14:22     ` Oliver Bandel
2005-11-18 14:37       ` Florian Weimer
2005-11-18 15:05         ` Thomas Fischbacher
2005-11-18 15:14           ` Florian Weimer
2005-11-18 16:03             ` Thomas Fischbacher
2005-11-18 20:03               ` Gerd Stolpmann
2005-11-18 20:01             ` Gerd Stolpmann
2005-11-18 21:12               ` Florian Weimer
2005-11-18 16:13         ` Oliver Bandel
2005-11-18 14:45     ` Florian Weimer
     [not found] ` <437CD0E5.8080503@yahoo.fr>
2005-11-17 20:02   ` Oliver Bandel
     [not found]     ` <437CE8EC.1070109@yahoo.fr>
2005-11-17 20:41       ` Oliver Bandel
2005-11-18 15:06         ` Florian Hars
     [not found] ` <437BD5F5.6010307@1969web.com>
2005-11-17 20:10   ` Oliver Bandel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r79fjxy4.fsf@mid.deneb.enyo.de \
    --to=fw@deneb.enyo.de \
    --cc=caml-list@inria.fr \
    --cc=oliver@first.in-berlin.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).