From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <1cfba402c4baec643b9a3e345744c255@coraid.com> From: erik quanstrom Date: Fri, 6 Jul 2007 12:01:53 -0400 To: 9fans@cse.psu.edu Subject: Re: [9fans] General Question: 9fans.mbox archive and problem solving w/computers In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: 92720976-ead2-11e9-9d60-3106f5b1d025 most machines these days have 10x that much memory. it should be speedy enough to use strstr(2) once you've loaded them into memory. and even loading them into memory should take no more than a few seconds at 80MB/s. a more elegant solution would be to reduce each document to a set of stemmed words, enumerate the set of all stems in all documents and create a bit array mapping stems to message #. but that seems like too much work for only 150MB. - erik