9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] General Question: 9fans.mbox archive and problem solving w/computers
@ 2007-07-06 15:11 Gregory Pavelcak
  2007-07-06 16:01 ` erik quanstrom
  0 siblings, 1 reply; 2+ messages in thread
From: Gregory Pavelcak @ 2007-07-06 15:11 UTC (permalink / raw)
  To: 9fans

Just thinking about this probably fairly simple task, but it
seems a bit overwhelming. Suppose you want a searchable
archive of 9fans and all you have to start with is this single
150 MB file of ~46,000 messages. What do you do? The
answer isn't obvious to me.

Greg

P.S. I'm prepared to rely on existing searchable archives to
do this in real life. It's just that in the few minutes thought I've
given to this problem, I've become much more impressed
with those existing seachable archives and their very
quick responses.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [9fans] General Question: 9fans.mbox archive and problem solving w/computers
  2007-07-06 15:11 [9fans] General Question: 9fans.mbox archive and problem solving w/computers Gregory Pavelcak
@ 2007-07-06 16:01 ` erik quanstrom
  0 siblings, 0 replies; 2+ messages in thread
From: erik quanstrom @ 2007-07-06 16:01 UTC (permalink / raw)
  To: 9fans

most machines these days have 10x that much memory.  it should
be speedy enough to use strstr(2) once you've loaded them into
memory.   and even loading them into memory should take no
more than a few seconds at 80MB/s.

a more elegant solution would be to reduce each document to
a set of stemmed words, enumerate the set of all stems in all
documents and create a bit array mapping stems to message #.
but that seems like too much work for only 150MB.

- erik


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2007-07-06 16:01 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-06 15:11 [9fans] General Question: 9fans.mbox archive and problem solving w/computers Gregory Pavelcak
2007-07-06 16:01 ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).