caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [1/2 OT] Indexing (and mergeable Index-algorithms)
@ 2005-11-16 23:42 Oliver Bandel
  2005-11-17  8:15 ` [Caml-list] " skaller
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Oliver Bandel @ 2005-11-16 23:42 UTC (permalink / raw)
  To: caml-list

Hello,


I'm looking for indexing algorithms and especially - if
such a thing exists - mergeable/extendable indexing algorithms.

So, say we have 10^6 texts that we want ot have an index for,
to retrieve the text according to some parts of the text
(keywords, substrings,...).
We want to make an index from these texts.

After a while we get 10^5 new texts and want to extend
the exisiting index, so that the whole index not necessarily
must be created again, with the indexer-tool running on
all files (^10^6 + 10^5) again, but only have to index the new files,
but the big index can be extended with additional smaller indizes.

Is there something like that already existing?
Or must the new index be created on all files again,
or must there be a workaround with the big and a small index-file,
where handling of both would be a solution we must provide by ourselfes?

It's mainly a question on datastructures/algorithms, so this mailing list
may be the wrong, but the reason to aske here is: Are functional
datastructures in some way good for implementing such tools?


BTW: Let's mention that the application I intended to write is
     performance critical.... so, if functional datastructures
     are quite good for such extendable indexes, but are too slow,
     then thsi would also be a problem.


Any hints here?
(Maybe using OCaml, but the imperative features of it would help,
 if the functional features would be too slow?)

Any hint on algorithms/datastructures for this would be fine...


Thanks In Advance,
           Oliver



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel
@ 2005-11-17  8:15 ` skaller
  2005-11-17 15:09   ` Brian Hurt
  2005-11-17  8:35 ` Florian Hars
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: skaller @ 2005-11-17  8:15 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

On Thu, 2005-11-17 at 00:42 +0100, Oliver Bandel wrote:

> So, say we have 10^6 texts that we want ot have an index for,
> to retrieve the text according to some parts of the text
> (keywords, substrings,...).
> We want to make an index from these texts.

BTree or modern equivalent may be an option.

NTFS directories are BTrees.

Balanced BTree guarantees extremely small constant time lookup
for all keys. (typically 3 or 4 accesses) 
Many updates are also fast constant time. However
the Btrees are very expensive to rebalance, and occasionally
an update requires a global rebalancing which brings the
world to a complete stop for a very long time.

BTree is ideal for large data structures, since the nodes
are chosen to match up with disk block sizes, and the underlying
I/O is done with low level I/O calls, that is, BTree is excellent
for hierarchical storage media (such as virtual memory).

One system I worked on used BTrees with spill list for
those updates which would require a rebalancing.
The merge/rebalancing is then done whilst the tree
is offline, accesses are slowed down by the spill
list until the rebalancing is done.

Btree is probably good for a search engine where you 
run two trees on two servers and do the rebalancing
on the offline one (then swap servers).

Some modern variants amortise the costs differently,
typically reducing the worst case at the expense
of the other operations. In particular, the very
worst way to populate a BTree from a list is if
the list is already sorted, however a smart algorithm
can build an empty skeleton first and populate it
in linear time (provided only you know
how many keys there are). Obviously, the tree
has to be offline until this operation is completed.

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel
  2005-11-17  8:15 ` [Caml-list] " skaller
@ 2005-11-17  8:35 ` Florian Hars
  2005-11-17  9:24   ` Oliver Bandel
  2005-11-17 11:49 ` Florian Weimer
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Florian Hars @ 2005-11-17  8:35 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

Oliver Bandel wrote:
> It's mainly a question on datastructures/algorithms

I tend to try to find answers to such questions on CiteSeer, maybe you could 
start at the first paper I found with a quick search:
http://citeseer.ist.psu.edu/cutting90optimizations.html
and then look at the papers citing it, or similar to it.

(WARNING: Excessive use of CiteSeer may lead to addiction.)

Yours, Florian.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17  8:35 ` Florian Hars
@ 2005-11-17  9:24   ` Oliver Bandel
  2005-11-17 12:39     ` Florian Weimer
  0 siblings, 1 reply; 30+ messages in thread
From: Oliver Bandel @ 2005-11-17  9:24 UTC (permalink / raw)
  To: caml-list

On Thu, Nov 17, 2005 at 09:35:58AM +0100, Florian Hars wrote:
> Oliver Bandel wrote:                                                                                                   
> >It's mainly a question on datastructures/algorithms                                                                   
>                                                                                                                        
> I tend to try to find answers to such questions on CiteSeer, maybe you                                                 
> could start at the first paper I found with a quick search:                                                            
> http://citeseer.ist.psu.edu/cutting90optimizations.html                                                                
> and then look at the papers citing it, or similar to it.                                                               


well, thats, where my further search directed me to. :)

I found an interesting paper there, about using updatable
indexing ("Fast Incremental Indexing for Full-Text Information Retrieval"
from Brown/Callen/Croft.) They talked about "inverted lists",
and this together with other hints from this list may be
a good starter.


>                                                                                                                        
> (WARNING: Excessive use of CiteSeer may lead to addiction.)                                                            

Yes, that's true.
I was an addict and hope to get clean, but now google
(and you too) directed me back... ;-)


Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel
  2005-11-17  8:15 ` [Caml-list] " skaller
  2005-11-17  8:35 ` Florian Hars
@ 2005-11-17 11:49 ` Florian Weimer
  2005-11-17 13:55   ` Richard Jones
  2005-11-18 14:54   ` Jonathan Bryant
       [not found] ` <437CD0E5.8080503@yahoo.fr>
       [not found] ` <437BD5F5.6010307@1969web.com>
  4 siblings, 2 replies; 30+ messages in thread
From: Florian Weimer @ 2005-11-17 11:49 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

* Oliver Bandel:

> So, say we have 10^6 texts that we want ot have an index for,
> to retrieve the text according to some parts of the text
> (keywords, substrings,...).
> We want to make an index from these texts.

B-trees are often used for that purpose.

> After a while we get 10^5 new texts and want to extend
> the exisiting index, so that the whole index not necessarily
> must be created again, with the indexer-tool running on
> all files (^10^6 + 10^5) again, but only have to index the new files,
> but the big index can be extended with additional smaller indizes.

10^6 documents won't fit into available RAM.  What's worse, you cannot
afford to reindex everything just because the machine crashed.  This
means that you need some form of crash recovery (e.g. write-ahead
logging).

> Is there something like that already existing?

Plenty.  Berkeley DB, SQLite, full-blown SQL database servers like
PostgreSQL or MySQL.  The list is pretty long.

> It's mainly a question on datastructures/algorithms, so this mailing list
> may be the wrong, but the reason to aske here is: Are functional
> datastructures in some way good for implementing such tools?

They are called "log-structured databases".  Typically, they offer
superior write performance (especially if you can waste a lot of disk
space and delay garbage collection), but read access cannot match
performance of more traditional approaches (like B-trees plus
write-ahead logging).

> (Maybe using OCaml, but the imperative features of it would help,
>  if the functional features would be too slow?)

If I were you, I'd reuse existing libraries (not written in Objective
Caml), so this question wouldn't be relevant.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17  9:24   ` Oliver Bandel
@ 2005-11-17 12:39     ` Florian Weimer
  2005-11-17 20:57       ` Oliver Bandel
  0 siblings, 1 reply; 30+ messages in thread
From: Florian Weimer @ 2005-11-17 12:39 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

* Oliver Bandel:

> I found an interesting paper there, about using updatable indexing
> ("Fast Incremental Indexing for Full-Text Information Retrieval"
> from Brown/Callen/Croft.) They talked about "inverted lists", and
> this together with other hints from this list may be a good starter.

If you need a full-text search capability on natural language
documents, there are various libraries you could reuse: Xapian, Lucene
(although this is one is writtein Java, IIRC), Estraier, OpenFTS, a
MySQL component, and probably many more.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 11:49 ` Florian Weimer
@ 2005-11-17 13:55   ` Richard Jones
  2005-11-18 14:54   ` Jonathan Bryant
  1 sibling, 0 replies; 30+ messages in thread
From: Richard Jones @ 2005-11-17 13:55 UTC (permalink / raw)
  To: caml-list

On Thu, Nov 17, 2005 at 12:49:55PM +0100, Florian Weimer wrote:
> Plenty.  Berkeley DB, SQLite, full-blown SQL database servers like
> PostgreSQL or MySQL.  The list is pretty long.

We use PostgreSQL's tsearch2[1] module to index web pages across our
main site and customer sites.  Today we have 38,437 pages including
old versions in the index.

Pros:

* Extremely easy to use - you just insert pages as rows in the database.
* Very featureful - does stemming, multiple language support, etc.
* Works from OCaml using, eg., ocamldbi, OCaml-PostgreSQL module.

Cons:

* Quite hard to install - you need to read the documentation carefully.
* Slow for lookups - I haven't quite got to the bottom of this so I
  don't know if it's inherently slow or if I haven't set up the indexes
  right.

Rich.

[1] http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro

-- 
Richard Jones, CTO Merjis Ltd.
Merjis - web marketing and technology - http://merjis.com
Team Notepad - intranets and extranets for business - http://team-notepad.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17  8:15 ` [Caml-list] " skaller
@ 2005-11-17 15:09   ` Brian Hurt
  2005-11-17 17:31     ` skaller
  0 siblings, 1 reply; 30+ messages in thread
From: Brian Hurt @ 2005-11-17 15:09 UTC (permalink / raw)
  To: skaller; +Cc: Oliver Bandel, caml-list



On Thu, 17 Nov 2005, skaller wrote:

> Balanced BTree guarantees extremely small constant time lookup
> for all keys. (typically 3 or 4 accesses)
> Many updates are also fast constant time. However
> the Btrees are very expensive to rebalance, and occasionally
> an update requires a global rebalancing which brings the
> world to a complete stop for a very long time.

Not as I understand BTrees.  BTrees are generializations of 2,3,4 trees. 
Each "node" of the tree has between k/2 and k (for k > 3) children (except 
for the root node, which can have anywhere from 2 to k children).  When 
adding a new child to a given node, if the number of children exceeds k, 
then the node is split into two nodes, each with (k+1)/2 children (if k is 
even, one of the two nodes gets the extra).  The new sibling is then added 
to the parent of the original node.  When the root node is split, then a 
new level is added above the root node (note, changing the depth of the 
entire tree at the same time).  Likewise, when children are removed, if 
the node falls below k/2 children, it's merged with one of it's siblings. 
If the root node drops to only having 1 child, it's removed, and it's lone 
child becomes the new root, again changing the depth of the entire tree at 
once.

For in-memory data structures, Btrees are less efficient than standard 
balanced binary trees.  See, the problem is that you do a search on the 
BTree, you have to do a binary search on the children at each node.  So 
the cost of doing a search on a BTree with N elements is log base k of N 
nodes times log base 2 of k for the binary search in each node.  A little 
bit of algebra proves that this is equal to the log base 2 of N, the same 
cost as searching a binary tree (basically).  Worse yet, when inserting 
or deleting an object, the average cost is O(k)- you have to insert 
or remove elements into/from the middle of an array.  If k > log base 2 of 
N (likely), then the standard balanced binary tree will be faster on 
inserts and deletes.  Where BTrees shine is in disk based data structures. 
Here, the main bottleneck is reading data off the disk.  For N=2^32 and 
k=256, a standard balanced binary tree would require up to 64 disk reads, 
the BTree only 5.

> Some modern variants amortise the costs differently,
> typically reducing the worst case at the expense
> of the other operations. In particular, the very
> worst way to populate a BTree from a list is if
> the list is already sorted, however a smart algorithm
> can build an empty skeleton first and populate it
> in linear time (provided only you know
> how many keys there are). Obviously, the tree
> has to be offline until this operation is completed.

Actually, if the data is already sorted, creating the tree should be O(N) 
cost.  At each level, you're adding children to the same node.  You keep 
adding children until you hit a limit (I'd suggest 3k/4 as a good limit). 
When one node is full, you create a new node (adding it to the next layer 
up) and start adding elements to it.  When you're done adding children, if 
the current node you're adding to has less than k/2 nodes, it gets merged 
with it's previous sibling.  Proving that this is O(N) total cost is left 
as an exercise to the reader.

Brian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 15:09   ` Brian Hurt
@ 2005-11-17 17:31     ` skaller
  2005-11-17 18:08       ` Brian Hurt
  0 siblings, 1 reply; 30+ messages in thread
From: skaller @ 2005-11-17 17:31 UTC (permalink / raw)
  To: Brian Hurt; +Cc: Oliver Bandel, caml-list

On Thu, 2005-11-17 at 09:09 -0600, Brian Hurt wrote:
> 
> On Thu, 17 Nov 2005, skaller wrote:
> 
> > Balanced BTree guarantees extremely small constant time lookup
> > for all keys. (typically 3 or 4 accesses)
> > Many updates are also fast constant time. However
> > the Btrees are very expensive to rebalance, and occasionally
> > an update requires a global rebalancing which brings the
> > world to a complete stop for a very long time.
> 
> Not as I understand BTrees. 

I'm not sure what it is we disagree on. 

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 17:31     ` skaller
@ 2005-11-17 18:08       ` Brian Hurt
  2005-11-17 18:57         ` skaller
  0 siblings, 1 reply; 30+ messages in thread
From: Brian Hurt @ 2005-11-17 18:08 UTC (permalink / raw)
  To: skaller; +Cc: Oliver Bandel, caml-list



On Fri, 18 Nov 2005, skaller wrote:

> On Thu, 2005-11-17 at 09:09 -0600, Brian Hurt wrote:
>>
>> On Thu, 17 Nov 2005, skaller wrote:
>>
>>> Balanced BTree guarantees extremely small constant time lookup
>>> for all keys. (typically 3 or 4 accesses)
>>> Many updates are also fast constant time. However
>>> the Btrees are very expensive to rebalance, and occasionally
>>> an update requires a global rebalancing which brings the
>>> world to a complete stop for a very long time.
>>
>> Not as I understand BTrees.
>
> I'm not sure what it is we disagree on.

They don't ever need global rebalancing.  Since levels are only added or 
removed at the top, the distance between the root and any leaf is the same 
as any other leaf, at all times.

Brian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 18:08       ` Brian Hurt
@ 2005-11-17 18:57         ` skaller
  2005-11-17 22:15           ` Brian Hurt
  0 siblings, 1 reply; 30+ messages in thread
From: skaller @ 2005-11-17 18:57 UTC (permalink / raw)
  To: Brian Hurt; +Cc: Oliver Bandel, caml-list

On Thu, 2005-11-17 at 12:08 -0600, Brian Hurt wrote:

> > I'm not sure what it is we disagree on.
> 
> They don't ever need global rebalancing.

Yup, they do not need it to maintain the invariant
you stated. But performance can still benefit from it
significantly.

Two quite distinct BTrees can contain the same data exactly,
it depends on the order of insertion of keys. If you are
clever, you can fill up a BTree so every block is exactly
full .. however you don't need to be clever to get the worst
possible BTree -- just fill an empty tree with sorted
data and almost all the blocks are guaranteed to be half
full (the worst possible case for storage use and
access time). Yet, this is a common case in practice.
** all the blocks will be half full except those on
the right edge of the tree -- for a tree of depth 5
that's millions of half full blocks, and only 5 that 
can possibly be more than half full :)

And in between, blocks can be filled to different extents.
Rebalancing adjusts this. 

The basic algorithm you describe has MANY tweaks which
attempt to rebalance the tree.

The version I played with did 'scrolling', where
instead of splitting an overfull block, you scroll
a key (and child) across to one of its siblings 
via the parent. This is quite tricky ..  but it
fixes the sorted insertion problem nicely --
with this modification, all the blocks are
always FULL .. except those on the right most
edge and their left siblings: they grow until
they're both full, then the rightmost one (only)
is split. So at worst, you waste 5 blocks
out of millions.. basically this doubles the 
capacity of the tree (built from a sorted list).

The same tweak can be used on random insertions
the same way, however the extra reads and write
may or may not be worth it, depending on how
valuable your disk storage is (etc).

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
       [not found] ` <437CD0E5.8080503@yahoo.fr>
@ 2005-11-17 20:02   ` Oliver Bandel
       [not found]     ` <437CE8EC.1070109@yahoo.fr>
  0 siblings, 1 reply; 30+ messages in thread
From: Oliver Bandel @ 2005-11-17 20:02 UTC (permalink / raw)
  To: caml-list

On Thu, Nov 17, 2005 at 07:50:13PM +0100, sejourne_kevin wrote:
> Oliver Bandel a écrit :
> >Any hints here?
> >(Maybe using OCaml, but the imperative features of it would help,
> > if the functional features would be too slow?)
> >
> >Any hint on algorithms/datastructures for this would be fine...
> >
> 
> In my work we use "Lucene" (apache.org). Lucene is in java but have been 
> re-coded in few other langage (c++,python...), the spec of the infex 
> format is availble for free (online).
> 
> I think the real problem for 10^x (x > 6) docs is the size of the index, 
> not really the speed of the answer(not for lucene (fast and small)).

The size of the index as well as creating time of it.
If it's not updateble and must be created new for every
update, than this needs to much time.

If updates are cheap, then thats the right thing.

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
       [not found] ` <437BD5F5.6010307@1969web.com>
@ 2005-11-17 20:10   ` Oliver Bandel
  0 siblings, 0 replies; 30+ messages in thread
From: Oliver Bandel @ 2005-11-17 20:10 UTC (permalink / raw)
  To: caml-list

On Wed, Nov 16, 2005 at 04:59:33PM -0800, Karl Zilles wrote:
> Oliver Bandel wrote:
> >I'm looking for indexing algorithms and especially - if
> >such a thing exists - mergeable/extendable indexing algorithms.
> >
> >So, say we have 10^6 texts that we want ot have an index for,
> >to retrieve the text according to some parts of the text
> >(keywords, substrings,...).
> >We want to make an index from these texts.
> >
> >After a while we get 10^5 new texts and want to extend
> >the exisiting index, so that the whole index not necessarily
> >must be created again, with the indexer-tool running on
> >all files (^10^6 + 10^5) again, but only have to index the new files,
> >but the big index can be extended with additional smaller indizes.
> >
> >Is there something like that already existing?
> >Or must the new index be created on all files again,
> >or must there be a workaround with the big and a small index-file,
> >where handling of both would be a solution we must provide by ourselfes?
> 
> I wrote a text indexing system a while ago in C++.  Pardon me if none of 
> the following is of interest:
[...]
 
> The B* file gets big, but the locations file gets huge.  Using this 
> methodology, you only ever modify the B* file, and never have to 
> reprocess your indexed documents.  Also you can continue to search the 
> index while documents are being indexed.

Wow, that is fine! :)

But mmap for example (you mentioned better/newer implementations)
can't be used for the whole index, because it definitely
will be larger than the available memory. So mmap and
such techniques can only be used for
parts of the index (but it may be a good choice to use it).

The update time was critical, but with indexes like you mentioned,
where reading while updating is possible (and when updating
is fast), this is very fine. :)


Best Regards, and Ciao,
                   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
       [not found]     ` <437CE8EC.1070109@yahoo.fr>
@ 2005-11-17 20:41       ` Oliver Bandel
  2005-11-18 15:06         ` Florian Hars
  0 siblings, 1 reply; 30+ messages in thread
From: Oliver Bandel @ 2005-11-17 20:41 UTC (permalink / raw)
  To: caml-list

On Thu, Nov 17, 2005 at 09:32:44PM +0100, sejourne_kevin wrote:
> Oliver Bandel a écrit :
> >If updates are cheap, then thats the right thing.
> About 1Go an Hour, not depend of the number of docs.

what is  "1Go"?

The update/addition of 10^5 entries in 10^6 entries
must be ready in about 20 minutes (maximum; faster is better).

Creating the first/main index can be hours, that's not
the problem, but updates must be fast.

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 12:39     ` Florian Weimer
@ 2005-11-17 20:57       ` Oliver Bandel
  2005-11-17 22:02         ` Florian Weimer
  0 siblings, 1 reply; 30+ messages in thread
From: Oliver Bandel @ 2005-11-17 20:57 UTC (permalink / raw)
  To: caml-list

On Thu, Nov 17, 2005 at 01:39:43PM +0100, Florian Weimer wrote:
> * Oliver Bandel:
> 
> > I found an interesting paper there, about using updatable indexing
> > ("Fast Incremental Indexing for Full-Text Information Retrieval"
> > from Brown/Callen/Croft.) They talked about "inverted lists", and
> > this together with other hints from this list may be a good starter.
> 
> If you need a full-text search capability on natural language
> documents, there are various libraries you could reuse: Xapian, Lucene
> (although this is one is writtein Java, IIRC), Estraier, OpenFTS, a
> MySQL component, and probably many more.

constraint: have to use PostgreSQL.

Does it have fulltext search?

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 20:57       ` Oliver Bandel
@ 2005-11-17 22:02         ` Florian Weimer
  0 siblings, 0 replies; 30+ messages in thread
From: Florian Weimer @ 2005-11-17 22:02 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

* Oliver Bandel:

>> If you need a full-text search capability on natural language
>> documents, there are various libraries you could reuse: Xapian, Lucene
>> (although this is one is writtein Java, IIRC), Estraier, OpenFTS, a
>> MySQL component, and probably many more.
>
> constraint: have to use PostgreSQL.
>
> Does it have fulltext search?

OpenFTS is based on PostgreSQL and its tsearch2 contrib module.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 18:57         ` skaller
@ 2005-11-17 22:15           ` Brian Hurt
  2005-11-18  1:49             ` skaller
  0 siblings, 1 reply; 30+ messages in thread
From: Brian Hurt @ 2005-11-17 22:15 UTC (permalink / raw)
  To: skaller; +Cc: Oliver Bandel, caml-list



On Fri, 18 Nov 2005, skaller wrote:

> On Thu, 2005-11-17 at 12:08 -0600, Brian Hurt wrote:
>
>>> I'm not sure what it is we disagree on.
>>
>> They don't ever need global rebalancing.
>
> Yup, they do not need it to maintain the invariant
> you stated. But performance can still benefit from it
> significantly.
>
> Two quite distinct BTrees can contain the same data exactly,
> it depends on the order of insertion of keys. If you are
> clever, you can fill up a BTree so every block is exactly
> full .. however you don't need to be clever to get the worst
> possible BTree -- just fill an empty tree with sorted
> data and almost all the blocks are guaranteed to be half
> full (the worst possible case for storage use and
> access time). Yet, this is a common case in practice.
> ** all the blocks will be half full except those on
> the right edge of the tree -- for a tree of depth 5
> that's millions of half full blocks, and only 5 that
> can possibly be more than half full :)

This is the worst possible case- that each block is half full.  Which 
means that instead of log_k(N) blocks, you're having to touch log_{k/2}(N) 
blocks.  This means that if N=2^32 and k=256, that you need to read 5 
blocks instead of 4 (128^5 = 2^35).  And the number of blocks you need has 
about doubled.  Also note that the binary search per block is now cheaper 
(by one step), and the cost of inserting elements is half.

So the question becomes: is the performance advantage gained by 
rebalancing worth the cost?

If I was worried about it, I'd be inclined to be more agressive on merging 
and splitting nodes.  Basically, if the node is under 5/8th full, I'd look 
to steal some children from siblings.  If the node is over 7/8th full, I'd 
look to share some child with siblings.  Note that if you have three nodes 
each 1/2 full, you can combine the three into two nodes, each 3/4th full. 
You want to keep nodes about 3/4th full, as that makes it cheaper to add 
and delete elements.

> The version I played with did 'scrolling', where
> instead of splitting an overfull block, you scroll
> a key (and child) across to one of its siblings
> via the parent. This is quite tricky ..  but it
> fixes the sorted insertion problem nicely --
> with this modification, all the blocks are
> always FULL .. except those on the right most
> edge and their left siblings: they grow until
> they're both full, then the rightmost one (only)
> is split. So at worst, you waste 5 blocks
> out of millions.. basically this doubles the
> capacity of the tree (built from a sorted list).

Two problems with this: first, what happens when the sibling is full too, 
you can get into a case where an insert is O(N) cost, and second, this is 
assuming inserts only (I can still get to worst-case with deletes).

Brian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 22:15           ` Brian Hurt
@ 2005-11-18  1:49             ` skaller
  0 siblings, 0 replies; 30+ messages in thread
From: skaller @ 2005-11-18  1:49 UTC (permalink / raw)
  To: Brian Hurt; +Cc: Oliver Bandel, caml-list

On Thu, 2005-11-17 at 16:15 -0600, Brian Hurt wrote:

> 
> This is the worst possible case- that each block is half full.  Which 
> means that instead of log_k(N) blocks, you're having to touch log_{k/2}(N) 
> blocks.  This means that if N=2^32 and k=256, that you need to read 5 
> blocks instead of 4 (128^5 = 2^35).  And the number of blocks you need has 
> about doubled.  Also note that the binary search per block is now cheaper 
> (by one step), and the cost of inserting elements is half.
> 
> So the question becomes: is the performance advantage gained by 
> rebalancing worth the cost?

Yes, that's the question. And there is no single answer :)

Note, it is not 5 reads instead of 4, it is 3 reads instead of 2
(assuming the first two levels are cached).

A BTree system I used once was fixed at 3 levels. So it could
be kind of critical :)

> If I was worried about it, I'd be inclined to be more agressive on merging 
> and splitting nodes.  Basically, if the node is under 5/8th full, I'd look 
> to steal some children from siblings.  If the node is over 7/8th full, I'd 
> look to share some child with siblings.  Note that if you have three nodes 
> each 1/2 full, you can combine the three into two nodes, each 3/4th full. 
> You want to keep nodes about 3/4th full, as that makes it cheaper to add 
> and delete elements.

Yup. There are lots of possible tweaks :)

> Two problems with this: first, what happens when the sibling is full too, 
> you can get into a case where an insert is O(N) cost, and second, this is 
> assuming inserts only (I can still get to worst-case with deletes).

Depends precisely on the algorithm -- mine only looked once.
If the sibling was full, you just split as usual. Its a cheap 
hack :)

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 14:54   ` Jonathan Bryant
@ 2005-11-18 14:22     ` Oliver Bandel
  2005-11-18 14:37       ` Florian Weimer
  2005-11-18 14:45     ` Florian Weimer
  1 sibling, 1 reply; 30+ messages in thread
From: Oliver Bandel @ 2005-11-18 14:22 UTC (permalink / raw)
  To: caml-list

On Fri, Nov 18, 2005 at 09:54:09AM -0500, Jonathan Bryant wrote:
> On Thu, 2005-11-17 at 06:49, Florian Weimer wrote:
> 
> > Plenty.  Berkeley DB, SQLite, full-blown SQL database servers like
> > PostgreSQL or MySQL.  The list is pretty long.
> > 
> 
> After a quick search that only returned one dead project (Alpha, v0.0.1,
> last updated 2002), I have to ask: are there any OCaml bindings for
> BerkeleyDB?

Isn't Dbm-module that thing?
IMHO it is.

But there only string -> string  can be used.
So your key as well as value has to be string-type.

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 14:22     ` Oliver Bandel
@ 2005-11-18 14:37       ` Florian Weimer
  2005-11-18 15:05         ` Thomas Fischbacher
  2005-11-18 16:13         ` Oliver Bandel
  0 siblings, 2 replies; 30+ messages in thread
From: Florian Weimer @ 2005-11-18 14:37 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

* Oliver Bandel:

>> After a quick search that only returned one dead project (Alpha, v0.0.1,
>> last updated 2002), I have to ask: are there any OCaml bindings for
>> BerkeleyDB?
>
> Isn't Dbm-module that thing?

No, it isn't.  Berkeley DB offers much richer functionality, including
transactions and replication.

> But there only string -> string  can be used.
> So your key as well as value has to be string-type.

You have to use some serialization/deserialization code.  The same
issue arises if you use an SQL database.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 14:54   ` Jonathan Bryant
  2005-11-18 14:22     ` Oliver Bandel
@ 2005-11-18 14:45     ` Florian Weimer
  1 sibling, 0 replies; 30+ messages in thread
From: Florian Weimer @ 2005-11-18 14:45 UTC (permalink / raw)
  To: jtbryant; +Cc: caml-list

* Jonathan Bryant:

> After a quick search that only returned one dead project (Alpha, v0.0.1,
> last updated 2002), I have to ask: are there any OCaml bindings for
> BerkeleyDB?

Lex Stein maintains a binding: <http://www.eecs.harvard.edu/~stein/>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 11:49 ` Florian Weimer
  2005-11-17 13:55   ` Richard Jones
@ 2005-11-18 14:54   ` Jonathan Bryant
  2005-11-18 14:22     ` Oliver Bandel
  2005-11-18 14:45     ` Florian Weimer
  1 sibling, 2 replies; 30+ messages in thread
From: Jonathan Bryant @ 2005-11-18 14:54 UTC (permalink / raw)
  To: caml-list

On Thu, 2005-11-17 at 06:49, Florian Weimer wrote:

> Plenty.  Berkeley DB, SQLite, full-blown SQL database servers like
> PostgreSQL or MySQL.  The list is pretty long.
> 

After a quick search that only returned one dead project (Alpha, v0.0.1,
last updated 2002), I have to ask: are there any OCaml bindings for
BerkeleyDB?

--Jonathan Bryant
  jtbryant@valdosta.edu
  Student Intern
  Unix System Operations
  VSU Information Technology

"Das Leben ohne Music ist einfach ein Irrtum, eine Strapaze, ein" Exil."
("Life without music is simply an error, a pain, an exile.")
--Frederich Nietzsche

"The three cardinal values of a programmer are laziness, impatience, and
hubris."
--Perl Man Page




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 14:37       ` Florian Weimer
@ 2005-11-18 15:05         ` Thomas Fischbacher
  2005-11-18 15:14           ` Florian Weimer
  2005-11-18 16:13         ` Oliver Bandel
  1 sibling, 1 reply; 30+ messages in thread
From: Thomas Fischbacher @ 2005-11-18 15:05 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Oliver Bandel, caml-list


On Fri, 18 Nov 2005, Florian Weimer wrote:

> >> After a quick search that only returned one dead project (Alpha, v0.0.1,
> >> last updated 2002), I have to ask: are there any OCaml bindings for
> >> BerkeleyDB?
> >
> > Isn't Dbm-module that thing?
> 
> No, it isn't.  Berkeley DB offers much richer functionality, including
> transactions and replication.

However, crashes at unfortunate moments may lead to database corruption, 
plus the license is a bit strange.

If you can live with that, however, it's a pretty neat thing.

-- 
regards,               tf@cip.physik.uni-muenchen.de              (o_
 Thomas Fischbacher -  http://www.cip.physik.uni-muenchen.de/~tf  //\
(lambda (n) ((lambda (p q r) (p p q r)) (lambda (g x y)           V_/_
(if (= x 0) y (g g (- x 1) (* x y)))) n 1))                  (Debian GNU)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-17 20:41       ` Oliver Bandel
@ 2005-11-18 15:06         ` Florian Hars
  0 siblings, 0 replies; 30+ messages in thread
From: Florian Hars @ 2005-11-18 15:06 UTC (permalink / raw)
  To: Oliver Bandel; +Cc: caml-list

Oliver Bandel wrote:
> what is  "1Go"?

One Giga Octet == 1GB (this *is* a french language list, after all :-)).

Yours, Florian.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 15:05         ` Thomas Fischbacher
@ 2005-11-18 15:14           ` Florian Weimer
  2005-11-18 16:03             ` Thomas Fischbacher
  2005-11-18 20:01             ` Gerd Stolpmann
  0 siblings, 2 replies; 30+ messages in thread
From: Florian Weimer @ 2005-11-18 15:14 UTC (permalink / raw)
  To: Thomas Fischbacher; +Cc: Oliver Bandel, caml-list

* Thomas Fischbacher:

>> No, it isn't.  Berkeley DB offers much richer functionality, including
>> transactions and replication.
>
> However, crashes at unfortunate moments may lead to database
> corruption,

All my crashes could be traced back to operator error and faulty
hardware.  Berkeley DB has a very bad reputation when it comes to data
integrity and durability, but in my experience, this is simply not
true.

> plus the license is a bit strange.

It's a very simple, GPL-compatible copyleft license.  Pretty standard.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 15:14           ` Florian Weimer
@ 2005-11-18 16:03             ` Thomas Fischbacher
  2005-11-18 20:03               ` Gerd Stolpmann
  2005-11-18 20:01             ` Gerd Stolpmann
  1 sibling, 1 reply; 30+ messages in thread
From: Thomas Fischbacher @ 2005-11-18 16:03 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Oliver Bandel, caml-list


On Fri, 18 Nov 2005, Florian Weimer wrote:

> >> No, it isn't.  Berkeley DB offers much richer functionality, including
> >> transactions and replication.
> >
> > However, crashes at unfortunate moments may lead to database
> > corruption,
> 
> All my crashes could be traced back to operator error and faulty
> hardware.  Berkeley DB has a very bad reputation when it comes to data
> integrity and durability, but in my experience, this is simply not
> true.

I experienced data loss in the past - twice in eight years of 
running it under considerable load. Both times, we had good recent 
backups, so this was not a big issue.

> > plus the license is a bit strange.
> 
> It's a very simple, GPL-compatible copyleft license.  Pretty standard.

Hm, I just had a look and you are right. I thought I'd once read something 
about it being free only for single-server installations or non-commercial 
use, but I can't find that in the Debian package's copyright. Strange.

-- 
regards,               tf@cip.physik.uni-muenchen.de              (o_
 Thomas Fischbacher -  http://www.cip.physik.uni-muenchen.de/~tf  //\
(lambda (n) ((lambda (p q r) (p p q r)) (lambda (g x y)           V_/_
(if (= x 0) y (g g (- x 1) (* x y)))) n 1))                  (Debian GNU)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 14:37       ` Florian Weimer
  2005-11-18 15:05         ` Thomas Fischbacher
@ 2005-11-18 16:13         ` Oliver Bandel
  1 sibling, 0 replies; 30+ messages in thread
From: Oliver Bandel @ 2005-11-18 16:13 UTC (permalink / raw)
  To: caml-list

On Fri, Nov 18, 2005 at 03:37:39PM +0100, Florian Weimer wrote:
> * Oliver Bandel:
> 
> >> After a quick search that only returned one dead project (Alpha, v0.0.1,
> >> last updated 2002), I have to ask: are there any OCaml bindings for
> >> BerkeleyDB?
> >
> > Isn't Dbm-module that thing?
> 
> No, it isn't.  Berkeley DB offers much richer functionality, including
> transactions and replication.
> 
> > But there only string -> string  can be used.
> > So your key as well as value has to be string-type.
> 
> You have to use some serialization/deserialization code.  The same
> issue arises if you use an SQL database.

Maybe rewriting from scratch would make sense.
Then I neither need a SQL-database nor writing
interface code.

If the Dbm is fast enbough and can handle such big
amounts of data/files efficiently enough, I would
write my own indexer then, using the mentioned
idea of inverted lists (but using a think like
inverted trees).

Ciao,
   Oliver


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 15:14           ` Florian Weimer
  2005-11-18 16:03             ` Thomas Fischbacher
@ 2005-11-18 20:01             ` Gerd Stolpmann
  2005-11-18 21:12               ` Florian Weimer
  1 sibling, 1 reply; 30+ messages in thread
From: Gerd Stolpmann @ 2005-11-18 20:01 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Thomas Fischbacher, Oliver Bandel, caml-list

Am Freitag, den 18.11.2005, 16:14 +0100 schrieb Florian Weimer:
> * Thomas Fischbacher:
> 
> >> No, it isn't.  Berkeley DB offers much richer functionality, including
> >> transactions and replication.
> >
> > However, crashes at unfortunate moments may lead to database
> > corruption,
> 
> All my crashes could be traced back to operator error and faulty
> hardware.  Berkeley DB has a very bad reputation when it comes to data
> integrity and durability, but in my experience, this is simply not
> true.

I think it is true.

A good database must not become corrupt even if the hardware crashes. It
must be possible to simply undo the last operation after a crash, and
the integrity must not suffer from that. Although BDB promises this
stability, I had to throw away a number of databases after hardware
crashes. There are simply errors in it, and I cannot recommend it
anymore.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Telefon: 06151/153855                  Telefax: 06151/997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 16:03             ` Thomas Fischbacher
@ 2005-11-18 20:03               ` Gerd Stolpmann
  0 siblings, 0 replies; 30+ messages in thread
From: Gerd Stolpmann @ 2005-11-18 20:03 UTC (permalink / raw)
  To: Thomas Fischbacher; +Cc: Florian Weimer, Oliver Bandel, caml-list

Am Freitag, den 18.11.2005, 17:03 +0100 schrieb Thomas Fischbacher:
> Hm, I just had a look and you are right. I thought I'd once read something 
> about it being free only for single-server installations or non-commercial 
> use, but I can't find that in the Debian package's copyright. Strange.

The license was changed some time ago. I can remember the initial
commercial BDB release (i.e. BDB >= 2.0) was GPL-incompatible.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Telefon: 06151/153855                  Telefax: 06151/997714
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Caml-list] [1/2 OT] Indexing (and mergeable Index-algorithms)
  2005-11-18 20:01             ` Gerd Stolpmann
@ 2005-11-18 21:12               ` Florian Weimer
  0 siblings, 0 replies; 30+ messages in thread
From: Florian Weimer @ 2005-11-18 21:12 UTC (permalink / raw)
  To: Gerd Stolpmann; +Cc: Thomas Fischbacher, Oliver Bandel, caml-list

* Gerd Stolpmann:

>> All my crashes could be traced back to operator error and faulty
>> hardware.  Berkeley DB has a very bad reputation when it comes to data
>> integrity and durability, but in my experience, this is simply not
>> true.
>
> I think it is true.
>
> A good database must not become corrupt even if the hardware crashes.

It depends on the crash.  In my case, hardware flipped bits randomly
under high load, eventually leading to a kernel crash.  This is hard
to correct in software.  Generally, it's better to fix the broken
hardware.

I haven't lost data due to simple crashes yet (power outages,
crash-early kernel bugs, or MCEs because of thermal issues -- knock on
wood).


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2005-11-18 21:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-16 23:42 [1/2 OT] Indexing (and mergeable Index-algorithms) Oliver Bandel
2005-11-17  8:15 ` [Caml-list] " skaller
2005-11-17 15:09   ` Brian Hurt
2005-11-17 17:31     ` skaller
2005-11-17 18:08       ` Brian Hurt
2005-11-17 18:57         ` skaller
2005-11-17 22:15           ` Brian Hurt
2005-11-18  1:49             ` skaller
2005-11-17  8:35 ` Florian Hars
2005-11-17  9:24   ` Oliver Bandel
2005-11-17 12:39     ` Florian Weimer
2005-11-17 20:57       ` Oliver Bandel
2005-11-17 22:02         ` Florian Weimer
2005-11-17 11:49 ` Florian Weimer
2005-11-17 13:55   ` Richard Jones
2005-11-18 14:54   ` Jonathan Bryant
2005-11-18 14:22     ` Oliver Bandel
2005-11-18 14:37       ` Florian Weimer
2005-11-18 15:05         ` Thomas Fischbacher
2005-11-18 15:14           ` Florian Weimer
2005-11-18 16:03             ` Thomas Fischbacher
2005-11-18 20:03               ` Gerd Stolpmann
2005-11-18 20:01             ` Gerd Stolpmann
2005-11-18 21:12               ` Florian Weimer
2005-11-18 16:13         ` Oliver Bandel
2005-11-18 14:45     ` Florian Weimer
     [not found] ` <437CD0E5.8080503@yahoo.fr>
2005-11-17 20:02   ` Oliver Bandel
     [not found]     ` <437CE8EC.1070109@yahoo.fr>
2005-11-17 20:41       ` Oliver Bandel
2005-11-18 15:06         ` Florian Hars
     [not found] ` <437BD5F5.6010307@1969web.com>
2005-11-17 20:10   ` Oliver Bandel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).