From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: To: 9fans@cse.psu.edu Subject: Re: [9fans] full text search From: Geoff Collyer MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Fri, 1 Nov 2002 16:02:54 -0800 Topicbox-Message-UUID: 1416769a-eacb-11e9-9e20-41e7f4b1d025 I've been meaning to port lq-text. It's relatively small and claimed to be 8-bit-clean. In the meantime, I use the scripts below, written in part to learn more about the problem. I index my mail each night from cron: 0 7 * * 0 cpu cd mbox && ftindex * */* >/n/other/index/mail I can then search the 95MB of it fairly quickly: : alpha; time ftlook -i $h/oth/index/mail boyd alpha firmware 2002oct25 9.1998 arch/9fans.2001jul11 arch/9fans.2001mar28 arch/9fans.2001nov16 arch/9fans.2001oct25 arch/9fans.2002aug11 arch/9fans.2002may13 arch/9fans.2002may27 arch/9fans.2002oct25 arch/9fans.2002sep18 0.05u 0.07s 0.26r ftlook -i /usr/geoff/oth/index/mail boyd alpha ... The named files contain all three words, ignoring case distinctions. ftlookword is specialised to use my mail index file and shows matching lines with line numbers: : alpha; time ftlookword mmc 9.1998:14172: in case of problems. the device name to use for the new yamahas is 'mmc' 9.1998:14175: 1) by default for now, an mmc device is in test-write mode. this means that 9.1998:14180: it should be followed by a message saying "mmcgettoc: blank disc". 9.1998:14212: scsi support didn't include the scsi-3 mmc command set). 9.1998:14333: when i changed it to use the mmc cdda read command rather 9.1998:14678: for mmc devices. arch/9fans.2001aug29:228: mmc.c does have this: arch/9fans.2001mar28:14272: hp scsi mmc optical jukeboxes (now in 9gb disk size!) are easily arch/9fans.2001mar28:14383: > money, hp scsi mmc optical jukeboxes (now arch/9fans.2001mar28:16889: boundary="upas-mnfzmmctbofdpmrurgpojpeugo" arch/9fans.2001mar28:16902: --upas-mnfzmmctbofdpmrurgpojpeugo arch/9fans.2001mar28:16919: --upas-mnfzmmctbofdpmrurgpojpeugo arch/9fans.2001mar28:16983: --upas-mnfzmmctbofdpmrurgpojpeugo-- arch/9fans.2002jul1:22387: diff /n/d/acme/bin/source/acd//mmc.c ./mmc.c=0A= arch/9fans.2002oct25:11409: i3gqQNvpcMMNOgFyXmD2htxpBR3ZNGOFi6HC3Cw0H9nVvUVdC/3jWO6eN4InFBeifFUxmmc65T8W optical:819: scsi support didn't include the scsi-3 mmc command set). 0.62u 0.21s 2.85r ftlookword mmc Note that this matched upas mime boundary lines; apparently I need to fine-tune the definition of `word'. # To unbundle, run this file echo ftindex sed 's/^X//' >ftindex <<'!' X#!/bin/rc X# index file... - generate full-text index X# indices can be combined via `sort -o bigindex -udf index*' if (~ $#* 0) X * = /fd/0 X# there's a lot of redundancy in the awk output, so strip duplicates X# in the output for each input file, then combine the stripped outputs. X{ X for (f) { X # limiting line & word length avoids indexing uuencoded & X # base64-encoded text. X awk ' X$0 != "" && (length($0) < 40 || $0 ~ /[^\t ][\t ][^\t ]/) { X $0 = tolower($0) X gsub(/=a0/, " ") X gsub(/[\/.,:;?!<>()[\]{}*=#%"''~|&^\\]/, " ") # delete most specials X for (i = 1; i <= NF; i++) X if (length($i) < 20 && X $i ~ /^[a-z\/][a-z0-9\-_\/]*[a-z0-9]$/ && X $i !~ /^(x-|message-id:)/) X print $i, FILENAME, NR X}' $f | X sort -udf +0 -2 X } X} | X sort -udf +0 -2 ! echo ftlook sed 's/^X//' >ftlook <<'!' X#!/bin/rc X# fulltext [-i index] word... - search full-text index fn usage { X echo usage: $1 '[-i index]' word... >[1=2] X exit usage X} idx=.index if (test $#* -ge 2) X switch ($1) { X case -i X idx=$2 X shift 2 X case -* X usage $0 X } if (test $#* -lt 1) X usage $0 X{ X for (arg) { X echo $arg X look -df -t ' ' $arg^' ' $idx X } X} | X sort +1 | X awk ' function delwd(wd) { delete fileword[wd] } function prmatch(w) { X any = 0 X for (w in fileword) X any++ X if (any == 0 && lastf != "") X print lastf X} BEGIN { lastf="" } NF == 1 { word[$1] = $1; fileword[$1] = $1; next } # a word we must match NF != 3 { print "badly formed index line: " $0 >"cat >[1=2]"; next } X$2 == lastf { delwd($1); next } # same old filename X{ X prmatch() X lastf = $2 X for (w in fileword) X delete fileword[w] # empty fileword X for (w in word) X fileword[w] = word[w] # copy word to fileword X delwd($1) X} END { prmatch() } X' ! echo ftlookword sed 's/^X//' >ftlookword <<'!' X#!/bin/rc cd $h/mbox exec grep -n $1 `{ftlook -i /n/other/index/mail $1} /dev/null !