caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Oliver Bandel <oliver@first.in-berlin.de>
To: caml-list@inria.fr
Subject: Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
Date: Thu, 17 Nov 2005 21:10:43 +0100	[thread overview]
Message-ID: <20051117201043.GA452@first.in-berlin.de> (raw)
In-Reply-To: <437BD5F5.6010307@1969web.com>

On Wed, Nov 16, 2005 at 04:59:33PM -0800, Karl Zilles wrote:
> Oliver Bandel wrote:
> >I'm looking for indexing algorithms and especially - if
> >such a thing exists - mergeable/extendable indexing algorithms.
> >
> >So, say we have 10^6 texts that we want ot have an index for,
> >to retrieve the text according to some parts of the text
> >(keywords, substrings,...).
> >We want to make an index from these texts.
> >
> >After a while we get 10^5 new texts and want to extend
> >the exisiting index, so that the whole index not necessarily
> >must be created again, with the indexer-tool running on
> >all files (^10^6 + 10^5) again, but only have to index the new files,
> >but the big index can be extended with additional smaller indizes.
> >
> >Is there something like that already existing?
> >Or must the new index be created on all files again,
> >or must there be a workaround with the big and a small index-file,
> >where handling of both would be a solution we must provide by ourselfes?
> 
> I wrote a text indexing system a while ago in C++.  Pardon me if none of 
> the following is of interest:
[...]
 
> The B* file gets big, but the locations file gets huge.  Using this 
> methodology, you only ever modify the B* file, and never have to 
> reprocess your indexed documents.  Also you can continue to search the 
> index while documents are being indexed.

Wow, that is fine! :)

But mmap for example (you mentioned better/newer implementations)
can't be used for the whole index, because it definitely
will be larger than the available memory. So mmap and
such techniques can only be used for
parts of the index (but it may be a good choice to use it).

The update time was critical, but with indexes like you mentioned,
where reading while updating is possible (and when updating
is fast), this is very fine. :)


Best Regards, and Ciao,
                   Oliver


      parent reply	other threads:[~2005-11-17 20:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-16 23:42 Oliver Bandel
2005-11-17  8:15 ` [Caml-list] " skaller
2005-11-17 15:09   ` Brian Hurt
2005-11-17 17:31     ` skaller
2005-11-17 18:08       ` Brian Hurt
2005-11-17 18:57         ` skaller
2005-11-17 22:15           ` Brian Hurt
2005-11-18  1:49             ` skaller
2005-11-17  8:35 ` Florian Hars
2005-11-17  9:24   ` Oliver Bandel
2005-11-17 12:39     ` Florian Weimer
2005-11-17 20:57       ` Oliver Bandel
2005-11-17 22:02         ` Florian Weimer
2005-11-17 11:49 ` Florian Weimer
2005-11-17 13:55   ` Richard Jones
2005-11-18 14:54   ` Jonathan Bryant
2005-11-18 14:22     ` Oliver Bandel
2005-11-18 14:37       ` Florian Weimer
2005-11-18 15:05         ` Thomas Fischbacher
2005-11-18 15:14           ` Florian Weimer
2005-11-18 16:03             ` Thomas Fischbacher
2005-11-18 20:03               ` Gerd Stolpmann
2005-11-18 20:01             ` Gerd Stolpmann
2005-11-18 21:12               ` Florian Weimer
2005-11-18 16:13         ` Oliver Bandel
2005-11-18 14:45     ` Florian Weimer
     [not found] ` <437CD0E5.8080503@yahoo.fr>
2005-11-17 20:02   ` Oliver Bandel
     [not found]     ` <437CE8EC.1070109@yahoo.fr>
2005-11-17 20:41       ` Oliver Bandel
2005-11-18 15:06         ` Florian Hars
     [not found] ` <437BD5F5.6010307@1969web.com>
2005-11-17 20:10   ` Oliver Bandel [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20051117201043.GA452@first.in-berlin.de \
    --to=oliver@first.in-berlin.de \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).