From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id AE1F8BB9C for ; Thu, 17 Nov 2005 21:11:25 +0100 (CET) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id jAHKBPXC002051 for ; Thu, 17 Nov 2005 21:11:25 +0100 Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id VAA27254 for ; Thu, 17 Nov 2005 21:11:24 +0100 (MET) Received: from einhorn.in-berlin.de (einhorn.in-berlin.de [192.109.42.8]) by nez-perce.inria.fr (8.13.0/8.13.0) with ESMTP id jAHKBOqj010817 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Thu, 17 Nov 2005 21:11:24 +0100 X-Envelope-From: oliver@first.in-berlin.de X-Envelope-To: Received: from first.in-berlin.de (e178009010.adsl.alicedsl.de [85.178.9.10]) (authenticated bits=0) by einhorn.in-berlin.de (8.12.10/8.12.10/Debian-4) with ESMTP id jAHKBMt4014322 for ; Thu, 17 Nov 2005 21:11:22 +0100 Received: by first.in-berlin.de (Postfix, from userid 501) id 46A4C193C26; Thu, 17 Nov 2005 21:10:44 +0100 (CET) Date: Thu, 17 Nov 2005 21:10:43 +0100 From: Oliver Bandel To: caml-list@inria.fr Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms) Message-ID: <20051117201043.GA452@first.in-berlin.de> References: <20051116234238.GA5741@first.in-berlin.de> <437BD5F5.6010307@1969web.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <437BD5F5.6010307@1969web.com> User-Agent: Mutt/1.5.6i X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Miltered: at concorde with ID 437CE3ED.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Miltered: at nez-perce with ID 437CE3EC.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; oliver:01 bandel:01 oliver:01 in-berlin:01 caml-list:01 indexing:01 bandel:01 indexing:01 substrings:01 ,...:98 exisiting:98 wrote:01 wrote:01 workaround:01 exists:01 X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=FORGED_RCVD_HELO autolearn=disabled version=3.0.3 On Wed, Nov 16, 2005 at 04:59:33PM -0800, Karl Zilles wrote: > Oliver Bandel wrote: > >I'm looking for indexing algorithms and especially - if > >such a thing exists - mergeable/extendable indexing algorithms. > > > >So, say we have 10^6 texts that we want ot have an index for, > >to retrieve the text according to some parts of the text > >(keywords, substrings,...). > >We want to make an index from these texts. > > > >After a while we get 10^5 new texts and want to extend > >the exisiting index, so that the whole index not necessarily > >must be created again, with the indexer-tool running on > >all files (^10^6 + 10^5) again, but only have to index the new files, > >but the big index can be extended with additional smaller indizes. > > > >Is there something like that already existing? > >Or must the new index be created on all files again, > >or must there be a workaround with the big and a small index-file, > >where handling of both would be a solution we must provide by ourselfes? > > I wrote a text indexing system a while ago in C++. Pardon me if none of > the following is of interest: [...] > The B* file gets big, but the locations file gets huge. Using this > methodology, you only ever modify the B* file, and never have to > reprocess your indexed documents. Also you can continue to search the > index while documents are being indexed. Wow, that is fine! :) But mmap for example (you mentioned better/newer implementations) can't be used for the whole index, because it definitely will be larger than the available memory. So mmap and such techniques can only be used for parts of the index (but it may be a good choice to use it). The update time was critical, but with indexes like you mentioned, where reading while updating is possible (and when updating is fast), this is very fine. :) Best Regards, and Ciao, Oliver