From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <weis@pauillac.inria.fr>
X-Original-To: caml-list@yquem.inria.fr
Delivered-To: caml-list@yquem.inria.fr
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39])
	by yquem.inria.fr (Postfix) with ESMTP id 0E80CBB9C
	for <caml-list@yquem.inria.fr>; Thu, 17 Nov 2005 00:43:22 +0100 (CET)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAGNhLkS031644
	for <caml-list@yquem.inria.fr>; Thu, 17 Nov 2005 00:43:21 +0100
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id AAA02299 for <caml-list@pauillac.inria.fr>; Thu, 17 Nov 2005 00:43:21 +0100 (MET)
Received: from einhorn.in-berlin.de (einhorn.in-berlin.de [192.109.42.8])
	by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAGNhKpM030014
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL)
	for <caml-list@inria.fr>; Thu, 17 Nov 2005 00:43:20 +0100
X-Envelope-From: oliver@first.in-berlin.de
X-Envelope-To: <caml-list@inria.fr>
Received: from first.in-berlin.de (e178040227.adsl.alicedsl.de [85.178.40.227])
	(authenticated bits=0)
	by einhorn.in-berlin.de (8.12.10/8.12.10/Debian-4) with ESMTP id jAGNhJJo003427
	for <caml-list@inria.fr>; Thu, 17 Nov 2005 00:43:19 +0100
Received: by first.in-berlin.de (Postfix, from userid 501)
	id AA7B0192C70; Thu, 17 Nov 2005 00:42:40 +0100 (CET)
Date: Thu, 17 Nov 2005 00:42:39 +0100
From: Oliver Bandel <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Subject: [1/2 OT] Indexing (and mergeable Index-algorithms)
Message-ID: <20051116234238.GA5741@first.in-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.6i
X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8
X-Miltered: at concorde with ID 437BC419.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Miltered: at nez-perce with ID 437BC418.001 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)!
X-Spam: no; 0.00; oliver:01 bandel:01 oliver:01 in-berlin:01 indexing:01 indexing:01 substrings:01 ocaml:01 ,...:98 exisiting:98 workaround:01 imperative:01 exists:01 btw:02 functional:02 
X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr
X-Spam-Level: 
X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO 
	autolearn=disabled version=3.0.3

Hello,


I'm looking for indexing algorithms and especially - if
such a thing exists - mergeable/extendable indexing algorithms.

So, say we have 10^6 texts that we want ot have an index for,
to retrieve the text according to some parts of the text
(keywords, substrings,...).
We want to make an index from these texts.

After a while we get 10^5 new texts and want to extend
the exisiting index, so that the whole index not necessarily
must be created again, with the indexer-tool running on
all files (^10^6 + 10^5) again, but only have to index the new files,
but the big index can be extended with additional smaller indizes.

Is there something like that already existing?
Or must the new index be created on all files again,
or must there be a workaround with the big and a small index-file,
where handling of both would be a solution we must provide by ourselfes?

It's mainly a question on datastructures/algorithms, so this mailing list
may be the wrong, but the reason to aske here is: Are functional
datastructures in some way good for implementing such tools?


BTW: Let's mention that the application I intended to write is
     performance critical.... so, if functional datastructures
     are quite good for such extendable indexes, but are too slow,
     then thsi would also be a problem.


Any hints here?
(Maybe using OCaml, but the imperative features of it would help,
 if the functional features would be too slow?)

Any hint on algorithms/datastructures for this would be fine...


Thanks In Advance,
           Oliver