From mboxrd@z Thu Jan  1 00:00:00 1970
To: 9fans@cse.psu.edu, rog@vitanuova.com
References: <2df4e3af3782344adbb24aae570efef9@vitanuova.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
In-Reply-To: <2df4e3af3782344adbb24aae570efef9@vitanuova.com>
Subject: Re: [9fans] Scaleable mail repositories.
Message-Id: <20051109033254.CD8F110F93@dexter-peak.quanstro.net>
Date: Tue,  8 Nov 2005 21:32:54 -0600
From: quanstro@quanstro.net (erik quanstrom)
Cc: 
Topicbox-Message-UUID: a97a4bc6-ead0-11e9-9d60-3106f5b1d025

i don't think it would be that bad.

in 1996 in my former life as an IBMer we put up 30 million documents 
plus the OpenText web index up on the web. we were using OpenText's
"pat" engine "cpl" (aka "cpl") and some other full-text search software to run the
queries. if you haven't heard of these, don't feel bad. they're pretty unremarkable
and no longer exist. performance absolutely bit. partially becuase the interface
between the driver and the engine was xml. (go figure. tim bray was the big
man at opentext.) but the main reason was that the index was set up for
grep-like searches, doing the matching directly against the (patricia-tree'd) text.

i've come to think that's backwards. i think you should scan the corpus for a 
list if unique stemmed terms. (run running Run Run! all considered the same term)
and assign each term an index number. each document can be represnted as a 
string of unique index numbers which can be indexed using normal techniques.
a search would first convert the terms to index numbers and then do the search.
regular expressions could be applied to the /term/ list*, not the corpus.

you could prototype this with 3 tables (term_tab, doc_term_xref, doc_tab) 
from almost any databaseish thing that allows concurrent updates and queries.

obviously there's some generality lost. (proximity searches, whitespace/newline matching.)
but, i think this would get you 80% of what you would want at 20% of the complexity.

so many things to program, so little time.

- erik "mr vaporware" quanstrom

* my quick-and-dirty term check of my own email archive gives ~33000 terms.
(this is a big overcount.)


; cat */* | \
	tr 'A-Z' 'a-z' | \
	sed 's/[][;:"{}~`!@#$%^&*()+=|\\/?<>,.]/ /g' | \
	grep -v '[^a-z0-9]$' | \
	awk '
{
	for (i=1; i<=NF;i++) {
		l = length($i);
		if (l>1 && l<15)
			A[$i]++
	} 
}

END {
	n=0; 
	for(i in A) {
		n++; 
		printf("%d %s\n", A[i], i);
	}
	print n
}' |wc -l
33558

rog@vitanuova.com writes

| i thought i'd write an external search algorithm - i'm most of the way
| through an extendable hash implementation (which seems simple and
| quick for insertion, but things get more complex when dealing with
| large values, and on deletion; i'm not sure of the best way to deal
| with block allocation; and more seriously, maybe it's essential to
| have an algorithm that can do range (e.g.  prefix) lookups).  any
| elegant (read *small*!), nicely implemented, open source libraries out
| there that might fit the bill?  a good description of an appropriate
| algorithm would do just as well...