9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: quanstro@quanstro.net (erik quanstrom)
To: 9fans@cse.psu.edu, rog@vitanuova.com
Subject: Re: [9fans] Scaleable mail repositories.
Date: Tue,  8 Nov 2005 21:32:54 -0600	[thread overview]
Message-ID: <20051109033254.CD8F110F93@dexter-peak.quanstro.net> (raw)
In-Reply-To: <2df4e3af3782344adbb24aae570efef9@vitanuova.com>

i don't think it would be that bad.

in 1996 in my former life as an IBMer we put up 30 million documents 
plus the OpenText web index up on the web. we were using OpenText's
"pat" engine "cpl" (aka "cpl") and some other full-text search software to run the
queries. if you haven't heard of these, don't feel bad. they're pretty unremarkable
and no longer exist. performance absolutely bit. partially becuase the interface
between the driver and the engine was xml. (go figure. tim bray was the big
man at opentext.) but the main reason was that the index was set up for
grep-like searches, doing the matching directly against the (patricia-tree'd) text.

i've come to think that's backwards. i think you should scan the corpus for a 
list if unique stemmed terms. (run running Run Run! all considered the same term)
and assign each term an index number. each document can be represnted as a 
string of unique index numbers which can be indexed using normal techniques.
a search would first convert the terms to index numbers and then do the search.
regular expressions could be applied to the /term/ list*, not the corpus.

you could prototype this with 3 tables (term_tab, doc_term_xref, doc_tab) 
from almost any databaseish thing that allows concurrent updates and queries.

obviously there's some generality lost. (proximity searches, whitespace/newline matching.)
but, i think this would get you 80% of what you would want at 20% of the complexity.

so many things to program, so little time.

- erik "mr vaporware" quanstrom

* my quick-and-dirty term check of my own email archive gives ~33000 terms.
(this is a big overcount.)


; cat */* | \
	tr 'A-Z' 'a-z' | \
	sed 's/[][;:"{}~`!@#$%^&*()+=|\\/?<>,.]/ /g' | \
	grep -v '[^a-z0-9]$' | \
	awk '
{
	for (i=1; i<=NF;i++) {
		l = length($i);
		if (l>1 && l<15)
			A[$i]++
	} 
}

END {
	n=0; 
	for(i in A) {
		n++; 
		printf("%d %s\n", A[i], i);
	}
	print n
}' |wc -l
33558

rog@vitanuova.com writes

| i thought i'd write an external search algorithm - i'm most of the way
| through an extendable hash implementation (which seems simple and
| quick for insertion, but things get more complex when dealing with
| large values, and on deletion; i'm not sure of the best way to deal
| with block allocation; and more seriously, maybe it's essential to
| have an algorithm that can do range (e.g.  prefix) lookups).  any
| elegant (read *small*!), nicely implemented, open source libraries out
| there that might fit the bill?  a good description of an appropriate
| algorithm would do just as well...


  parent reply	other threads:[~2005-11-09  3:32 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-10-29 15:34 [9fans] rfork(RFPROC) and ffork() erik quanstrom
2005-10-29 19:11 ` William Josephson
2005-10-29 19:18 ` Russ Cox
2005-10-29 23:00   ` erik quanstrom
2005-10-29 23:24     ` Francisco J Ballesteros
2005-10-29 23:38     ` Russ Cox
2005-10-30  0:19       ` erik quanstrom
2005-10-30  1:07         ` Russ Cox
2005-10-30  1:15           ` Ronald G Minnich
2005-10-30  1:22             ` geoff
2005-10-30  1:58               ` jmk
2005-10-30  1:54           ` Dave Eckhardt
2005-10-30  2:24           ` erik quanstrom
2005-10-30  2:51             ` geoff
2005-10-30  1:10         ` geoff
2005-10-30  1:18           ` Paul Lalonde
2005-10-30  6:52             ` Skip Tavakkolian
2005-10-30 10:14             ` Francisco J Ballesteros
2005-10-30 15:17             ` Russ Cox
2005-10-30 23:00               ` Dave Eckhardt
2005-10-30 23:14                 ` George Michaelson
2005-10-31  2:15                 ` erik quanstrom
2005-10-31  2:33                   ` geoff
2005-10-31  3:23                   ` Skip Tavakkolian
2005-10-31  4:20                 ` Lyndon Nerenberg
2005-10-31 21:31                   ` Dave Eckhardt
2005-10-31  4:06             ` [9fans] Scaleable mail repositories Lyndon Nerenberg
2005-10-31 10:55               ` C H Forsyth
2005-10-31 12:32                 ` erik quanstrom
2005-11-01 19:56                   ` rog
2005-11-01 22:29                     ` Francisco J Ballesteros
2005-11-08 19:56                       ` rog
2005-11-08 23:22                         ` Joel Salomon
2005-11-09  0:51                         ` Caerwyn Jones
2005-11-09  0:55                           ` Russ Cox
2005-11-09  3:32                         ` erik quanstrom [this message]
2005-10-31 15:30                 ` jmk
2005-10-30  1:10       ` [9fans] rfork(RFPROC) and ffork() William Josephson
2005-10-31 14:48 ` Russ Cox
2005-10-31 11:32 [9fans] Scaleable mail repositories Fco. J. Ballesteros
2005-10-31 16:01 ` Ronald G Minnich
2005-10-31 15:06   ` jmk
2005-10-31 15:14 Fco. J. Ballesteros
2005-10-31 16:22 ` Ronald G Minnich
2005-10-31 18:37   ` William Josephson
2005-10-31 15:19 Fco. J. Ballesteros
2005-10-31 15:33 Fco. J. Ballesteros
2005-10-31 18:38 ` William Josephson
2005-11-09  9:45 Fco. J. Ballesteros
2005-11-09 10:24 ` Charles Forsyth
2005-11-09 14:19   ` Sam
2005-11-10  1:24     ` erik quanstrom
2005-11-10  2:30       ` Russ Cox
2005-11-10  6:33         ` Scott Schwartz
2005-11-10 11:55         ` erik quanstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20051109033254.CD8F110F93@dexter-peak.quanstro.net \
    --to=quanstro@quanstro.net \
    --cc=9fans@cse.psu.edu \
    --cc=rog@vitanuova.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).