From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id 6BEACBB9C for ; Thu, 17 Nov 2005 12:50:00 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAHBnxsH008634 for ; Thu, 17 Nov 2005 12:50:00 +0100 Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id MAA14916 for ; Thu, 17 Nov 2005 12:49:59 +0100 (MET) Received: from mail.enyo.de (mail.enyo.de [212.9.189.167]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAHBnwex008627 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Thu, 17 Nov 2005 12:49:59 +0100 Received: from deneb.vpn.enyo.de ([212.9.189.177] helo=deneb.enyo.de) by albireo.enyo.de with esmtp id 1EciH7-0007UM-DE; Thu, 17 Nov 2005 12:49:57 +0100 Received: from fw by deneb.enyo.de with local (Exim 4.54) id 1EciH5-0003YZ-IH; Thu, 17 Nov 2005 12:49:55 +0100 From: Florian Weimer To: Oliver Bandel Cc: caml-list@inria.fr Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) References: <20051116234238.GA5741@first.in-berlin.de> Date: Thu, 17 Nov 2005 12:49:55 +0100 In-Reply-To: <20051116234238.GA5741@first.in-berlin.de> (Oliver Bandel's message of "Thu, 17 Nov 2005 00:42:39 +0100") Message-ID: <87r79fjxy4.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Miltered: at concorde with ID 437C6E67.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at concorde with ID 437C6E66.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; caml-list:01 indexing:01 oliver:01 bandel:01 substrings:01 garbage:01 ocaml:01 reuse:01 ,...:98 exisiting:98 imperative:01 caml:02 match:02 florian:02 objective:02 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=disabled version=3.0.3 * Oliver Bandel: > So, say we have 10^6 texts that we want ot have an index for, > to retrieve the text according to some parts of the text > (keywords, substrings,...). > We want to make an index from these texts. B-trees are often used for that purpose. > After a while we get 10^5 new texts and want to extend > the exisiting index, so that the whole index not necessarily > must be created again, with the indexer-tool running on > all files (^10^6 + 10^5) again, but only have to index the new files, > but the big index can be extended with additional smaller indizes. 10^6 documents won't fit into available RAM. What's worse, you cannot afford to reindex everything just because the machine crashed. This means that you need some form of crash recovery (e.g. write-ahead logging). > Is there something like that already existing? Plenty. Berkeley DB, SQLite, full-blown SQL database servers like PostgreSQL or MySQL. The list is pretty long. > It's mainly a question on datastructures/algorithms, so this mailing list > may be the wrong, but the reason to aske here is: Are functional > datastructures in some way good for implementing such tools? They are called "log-structured databases". Typically, they offer superior write performance (especially if you can waste a lot of disk space and delay garbage collection), but read access cannot match performance of more traditional approaches (like B-trees plus write-ahead logging). > (Maybe using OCaml, but the imperative features of it would help, > if the functional features would be too slow?) If I were you, I'd reuse existing libraries (not written in Objective Caml), so this question wouldn't be relevant.