From: Geoff Collyer <geoff@collyer.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] full text search
Date: Fri, 1 Nov 2002 16:02:54 -0800 [thread overview]
Message-ID: <cd5514cf296d6af5b21df715e18c17a3@collyer.net> (raw)
I've been meaning to port lq-text. It's relatively small and claimed
to be 8-bit-clean.
In the meantime, I use the scripts below, written in part to learn
more about the problem.
I index my mail each night from cron:
0 7 * * 0 cpu cd mbox && ftindex * */* >/n/other/index/mail
I can then search the 95MB of it fairly quickly:
: alpha; time ftlook -i $h/oth/index/mail boyd alpha firmware
2002oct25
9.1998
arch/9fans.2001jul11
arch/9fans.2001mar28
arch/9fans.2001nov16
arch/9fans.2001oct25
arch/9fans.2002aug11
arch/9fans.2002may13
arch/9fans.2002may27
arch/9fans.2002oct25
arch/9fans.2002sep18
0.05u 0.07s 0.26r ftlook -i /usr/geoff/oth/index/mail boyd alpha ...
The named files contain all three words, ignoring case distinctions.
ftlookword is specialised to use my mail index file and shows matching
lines with line numbers:
: alpha; time ftlookword mmc
9.1998:14172: in case of problems. the device name to use for the new yamahas is 'mmc'
9.1998:14175: 1) by default for now, an mmc device is in test-write mode. this means that
9.1998:14180: it should be followed by a message saying "mmcgettoc: blank disc".
9.1998:14212: scsi support didn't include the scsi-3 mmc command set).
9.1998:14333: when i changed it to use the mmc cdda read command rather
9.1998:14678: for mmc devices.
arch/9fans.2001aug29:228: mmc.c does have this:
arch/9fans.2001mar28:14272: hp scsi mmc optical jukeboxes (now in 9gb disk size!) are easily
arch/9fans.2001mar28:14383: > money, hp scsi mmc optical jukeboxes (now
arch/9fans.2001mar28:16889: boundary="upas-mnfzmmctbofdpmrurgpojpeugo"
arch/9fans.2001mar28:16902: --upas-mnfzmmctbofdpmrurgpojpeugo
arch/9fans.2001mar28:16919: --upas-mnfzmmctbofdpmrurgpojpeugo
arch/9fans.2001mar28:16983: --upas-mnfzmmctbofdpmrurgpojpeugo--
arch/9fans.2002jul1:22387: diff /n/d/acme/bin/source/acd//mmc.c ./mmc.c=0A=
arch/9fans.2002oct25:11409: i3gqQNvpcMMNOgFyXmD2htxpBR3ZNGOFi6HC3Cw0H9nVvUVdC/3jWO6eN4InFBeifFUxmmc65T8W
optical:819: scsi support didn't include the scsi-3 mmc command set).
0.62u 0.21s 2.85r ftlookword mmc
Note that this matched upas mime boundary lines; apparently I need to
fine-tune the definition of `word'.
# To unbundle, run this file
echo ftindex
sed 's/^X//' >ftindex <<'!'
X#!/bin/rc
X# index file... - generate full-text index
X# indices can be combined via `sort -o bigindex -udf index*'
if (~ $#* 0)
X * = /fd/0
X# there's a lot of redundancy in the awk output, so strip duplicates
X# in the output for each input file, then combine the stripped outputs.
X{
X for (f) {
X # limiting line & word length avoids indexing uuencoded &
X # base64-encoded text.
X awk '
X$0 != "" && (length($0) < 40 || $0 ~ /[^\t ][\t ][^\t ]/) {
X $0 = tolower($0)
X gsub(/=a0/, " ")
X gsub(/[\/.,:;?!<>()[\]{}*=#%"''~|&^\\]/, " ") # delete most specials
X for (i = 1; i <= NF; i++)
X if (length($i) < 20 &&
X $i ~ /^[a-z\/][a-z0-9\-_\/]*[a-z0-9]$/ &&
X $i !~ /^(x-|message-id:)/)
X print $i, FILENAME, NR
X}' $f |
X sort -udf +0 -2
X }
X} |
X sort -udf +0 -2
!
echo ftlook
sed 's/^X//' >ftlook <<'!'
X#!/bin/rc
X# fulltext [-i index] word... - search full-text index
fn usage {
X echo usage: $1 '[-i index]' word... >[1=2]
X exit usage
X}
idx=.index
if (test $#* -ge 2)
X switch ($1) {
X case -i
X idx=$2
X shift 2
X case -*
X usage $0
X }
if (test $#* -lt 1)
X usage $0
X{
X for (arg) {
X echo $arg
X look -df -t ' ' $arg^' ' $idx
X }
X} |
X sort +1 |
X awk '
function delwd(wd) { delete fileword[wd] }
function prmatch(w) {
X any = 0
X for (w in fileword)
X any++
X if (any == 0 && lastf != "")
X print lastf
X}
BEGIN { lastf="" }
NF == 1 { word[$1] = $1; fileword[$1] = $1; next } # a word we must match
NF != 3 { print "badly formed index line: " $0 >"cat >[1=2]"; next }
X$2 == lastf { delwd($1); next } # same old filename
X{
X prmatch()
X lastf = $2
X for (w in fileword)
X delete fileword[w] # empty fileword
X for (w in word)
X fileword[w] = word[w] # copy word to fileword
X delwd($1)
X}
END { prmatch() }
X'
!
echo ftlookword
sed 's/^X//' >ftlookword <<'!'
X#!/bin/rc
cd $h/mbox
exec grep -n $1 `{ftlook -i /n/other/index/mail $1} /dev/null
!
next reply other threads:[~2002-11-02 0:02 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-11-02 0:02 Geoff Collyer [this message]
-- strict thread matches above, loose matches on Subject: below --
2002-11-01 10:27 nigel
2002-11-01 12:45 ` Kenji Arisawa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cd5514cf296d6af5b21df715e18c17a3@collyer.net \
--to=geoff@collyer.net \
--cc=9fans@cse.psu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).