Re: [9fans] full text search

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

From: Geoff Collyer <geoff@collyer.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] full text search
Date: Fri,  1 Nov 2002 16:02:54 -0800	[thread overview]
Message-ID: <cd5514cf296d6af5b21df715e18c17a3@collyer.net> (raw)

I've been meaning to port lq-text.  It's relatively small and claimed
to be 8-bit-clean.

In the meantime, I use the scripts below, written in part to learn
more about the problem.

I index my mail each night from cron:

	0 7 * * 0	cpu	cd mbox && ftindex * */* >/n/other/index/mail

I can then search the 95MB of it fairly quickly:

	: alpha;  time ftlook -i $h/oth/index/mail boyd alpha firmware
	2002oct25
	9.1998
	arch/9fans.2001jul11
	arch/9fans.2001mar28
	arch/9fans.2001nov16
	arch/9fans.2001oct25
	arch/9fans.2002aug11
	arch/9fans.2002may13
	arch/9fans.2002may27
	arch/9fans.2002oct25
	arch/9fans.2002sep18
	0.05u 0.07s 0.26r 	 ftlook -i /usr/geoff/oth/index/mail boyd alpha ...

The named files contain all three words, ignoring case distinctions.

ftlookword is specialised to use my mail index file and shows matching
lines with line numbers:

	: alpha; time ftlookword mmc
	9.1998:14172: in case of problems. the device name to use for the new yamahas is 'mmc'
	9.1998:14175: 1) by default for now, an mmc device is in test-write mode. this means that
	9.1998:14180:    it should be followed by a message saying "mmcgettoc: blank disc".
	9.1998:14212: scsi support didn't include the scsi-3 mmc command set).
	9.1998:14333: when i changed it to use the mmc cdda read command rather
	9.1998:14678: for mmc devices.
	arch/9fans.2001aug29:228: mmc.c does have this:
	arch/9fans.2001mar28:14272: hp scsi mmc optical jukeboxes (now in 9gb disk size!) are easily
	arch/9fans.2001mar28:14383: > money, hp scsi mmc optical jukeboxes (now
	arch/9fans.2001mar28:16889: 	boundary="upas-mnfzmmctbofdpmrurgpojpeugo"
	arch/9fans.2001mar28:16902: --upas-mnfzmmctbofdpmrurgpojpeugo
	arch/9fans.2001mar28:16919: --upas-mnfzmmctbofdpmrurgpojpeugo
	arch/9fans.2001mar28:16983: --upas-mnfzmmctbofdpmrurgpojpeugo--
	arch/9fans.2002jul1:22387: diff /n/d/acme/bin/source/acd//mmc.c ./mmc.c=0A=
	arch/9fans.2002oct25:11409: i3gqQNvpcMMNOgFyXmD2htxpBR3ZNGOFi6HC3Cw0H9nVvUVdC/3jWO6eN4InFBeifFUxmmc65T8W
	optical:819: scsi support didn't include the scsi-3 mmc command set).
	0.62u 0.21s 2.85r 	 ftlookword mmc

Note that this matched upas mime boundary lines; apparently I need to
fine-tune the definition of `word'.


# To unbundle, run this file
echo ftindex
sed 's/^X//' >ftindex <<'!'
X#!/bin/rc
X# index file... - generate full-text index
X#	indices can be combined via `sort -o bigindex -udf index*'
if (~ $#* 0)
X	* = /fd/0

X# there's a lot of redundancy in the awk output, so strip duplicates
X# in the output for each input file, then combine the stripped outputs.
X{
X	for (f) {
X		# limiting line & word length avoids indexing uuencoded &
X		# base64-encoded text.
X		awk '
X$0 != "" && (length($0) < 40 || $0 ~ /[^\t ][\t ][^\t ]/) {
X	$0 = tolower($0)
X	gsub(/=a0/, " ")
X	gsub(/[\/.,:;?!<>()[\]{}*=#%"''~|&^\\]/, " ")	# delete most specials
X	for (i = 1; i <= NF; i++)
X		if (length($i) < 20 &&
X		    $i ~ /^[a-z\/][a-z0-9\-_\/]*[a-z0-9]$/ &&
X		    $i !~ /^(x-|message-id:)/)
X			print $i, FILENAME, NR
X}' $f |
X			sort -udf +0 -2
X	}
X} |
X		sort -udf +0 -2
!
echo ftlook
sed 's/^X//' >ftlook <<'!'
X#!/bin/rc
X# fulltext [-i index] word... - search full-text index
fn usage {
X	echo usage: $1 '[-i index]' word... >[1=2]
X	exit usage
X}

idx=.index
if (test $#* -ge 2)
X	switch ($1) {
X	case -i
X		idx=$2
X		shift 2
X	case -*
X		usage $0
X	}
if (test $#* -lt 1)
X	usage $0

X{
X	for (arg) {
X		echo $arg
X		look -df -t ' ' $arg^' ' $idx
X	}
X} |
X	sort +1 |
X	awk '
function delwd(wd) { delete fileword[wd] }
function prmatch(w) {
X	any = 0
X	for (w in fileword)
X		any++
X	if (any == 0 && lastf != "")
X		print lastf
X}

BEGIN	{ lastf="" }
NF == 1	{ word[$1] = $1; fileword[$1] = $1; next }	# a word we must match
NF != 3 { print "badly formed index line: " $0 >"cat >[1=2]"; next }
X$2 == lastf { delwd($1); next }		# same old filename
X{
X	prmatch()

X	lastf = $2
X	for (w in fileword)
X		delete fileword[w]	# empty fileword
X	for (w in word)
X		fileword[w] = word[w]	# copy word to fileword
X	delwd($1)
X}
END	{ prmatch() }
X'
!
echo ftlookword
sed 's/^X//' >ftlookword <<'!'
X#!/bin/rc
cd $h/mbox
exec grep -n $1 `{ftlook -i /n/other/index/mail $1} /dev/null
!

next             reply	other threads:[~2002-11-02  0:02 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-11-02  0:02 Geoff Collyer [this message]
  -- strict thread matches above, loose matches on Subject: below --
2002-11-01 10:27 nigel
2002-11-01 12:45 ` Kenji Arisawa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd5514cf296d6af5b21df715e18c17a3@collyer.net \
    --to=geoff@collyer.net \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).