9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] full text search
@ 2002-11-01 10:27 nigel
  2002-11-01 12:45 ` Kenji Arisawa
  0 siblings, 1 reply; 3+ messages in thread
From: nigel @ 2002-11-01 10:27 UTC (permalink / raw)
  To: 9fans

I'd like to offline (= overnight) index 'text' files so I can search them. Examples
would be email, source code. Any recommendations of suitable code to port?
I looked at glimpse, but was unsure about the license.





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [9fans] full text search
  2002-11-01 10:27 [9fans] full text search nigel
@ 2002-11-01 12:45 ` Kenji Arisawa
  0 siblings, 0 replies; 3+ messages in thread
From: Kenji Arisawa @ 2002-11-01 12:45 UTC (permalink / raw)
  To: 9fans

Hello Nigel,

 >I'd like to offline (= overnight) index 'text' files so I can search
them. Examples
 >would be email, source code. Any recommendations of suitable code to
port?
 >I looked at glimpse, but was unsure about the license.

Look
http://www.namazu.org/

Namazu is the most popular full-text search engine in Japan.
It is assumed that text is written in ujis , I believe, but may be easy
to support utf-8.
The web site is written in English.
Therefore you can read the documents.

Kenji Arisawa



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [9fans] full text search
@ 2002-11-02  0:02 Geoff Collyer
  0 siblings, 0 replies; 3+ messages in thread
From: Geoff Collyer @ 2002-11-02  0:02 UTC (permalink / raw)
  To: 9fans

I've been meaning to port lq-text.  It's relatively small and claimed
to be 8-bit-clean.

In the meantime, I use the scripts below, written in part to learn
more about the problem.

I index my mail each night from cron:

	0 7 * * 0	cpu	cd mbox && ftindex * */* >/n/other/index/mail

I can then search the 95MB of it fairly quickly:

	: alpha;  time ftlook -i $h/oth/index/mail boyd alpha firmware
	2002oct25
	9.1998
	arch/9fans.2001jul11
	arch/9fans.2001mar28
	arch/9fans.2001nov16
	arch/9fans.2001oct25
	arch/9fans.2002aug11
	arch/9fans.2002may13
	arch/9fans.2002may27
	arch/9fans.2002oct25
	arch/9fans.2002sep18
	0.05u 0.07s 0.26r 	 ftlook -i /usr/geoff/oth/index/mail boyd alpha ...

The named files contain all three words, ignoring case distinctions.

ftlookword is specialised to use my mail index file and shows matching
lines with line numbers:

	: alpha; time ftlookword mmc
	9.1998:14172: in case of problems. the device name to use for the new yamahas is 'mmc'
	9.1998:14175: 1) by default for now, an mmc device is in test-write mode. this means that
	9.1998:14180:    it should be followed by a message saying "mmcgettoc: blank disc".
	9.1998:14212: scsi support didn't include the scsi-3 mmc command set).
	9.1998:14333: when i changed it to use the mmc cdda read command rather
	9.1998:14678: for mmc devices.
	arch/9fans.2001aug29:228: mmc.c does have this:
	arch/9fans.2001mar28:14272: hp scsi mmc optical jukeboxes (now in 9gb disk size!) are easily
	arch/9fans.2001mar28:14383: > money, hp scsi mmc optical jukeboxes (now
	arch/9fans.2001mar28:16889: 	boundary="upas-mnfzmmctbofdpmrurgpojpeugo"
	arch/9fans.2001mar28:16902: --upas-mnfzmmctbofdpmrurgpojpeugo
	arch/9fans.2001mar28:16919: --upas-mnfzmmctbofdpmrurgpojpeugo
	arch/9fans.2001mar28:16983: --upas-mnfzmmctbofdpmrurgpojpeugo--
	arch/9fans.2002jul1:22387: diff /n/d/acme/bin/source/acd//mmc.c ./mmc.c=0A=
	arch/9fans.2002oct25:11409: i3gqQNvpcMMNOgFyXmD2htxpBR3ZNGOFi6HC3Cw0H9nVvUVdC/3jWO6eN4InFBeifFUxmmc65T8W
	optical:819: scsi support didn't include the scsi-3 mmc command set).
	0.62u 0.21s 2.85r 	 ftlookword mmc

Note that this matched upas mime boundary lines; apparently I need to
fine-tune the definition of `word'.


# To unbundle, run this file
echo ftindex
sed 's/^X//' >ftindex <<'!'
X#!/bin/rc
X# index file... - generate full-text index
X#	indices can be combined via `sort -o bigindex -udf index*'
if (~ $#* 0)
X	* = /fd/0

X# there's a lot of redundancy in the awk output, so strip duplicates
X# in the output for each input file, then combine the stripped outputs.
X{
X	for (f) {
X		# limiting line & word length avoids indexing uuencoded &
X		# base64-encoded text.
X		awk '
X$0 != "" && (length($0) < 40 || $0 ~ /[^\t ][\t ][^\t ]/) {
X	$0 = tolower($0)
X	gsub(/=a0/, " ")
X	gsub(/[\/.,:;?!<>()[\]{}*=#%"''~|&^\\]/, " ")	# delete most specials
X	for (i = 1; i <= NF; i++)
X		if (length($i) < 20 &&
X		    $i ~ /^[a-z\/][a-z0-9\-_\/]*[a-z0-9]$/ &&
X		    $i !~ /^(x-|message-id:)/)
X			print $i, FILENAME, NR
X}' $f |
X			sort -udf +0 -2
X	}
X} |
X		sort -udf +0 -2
!
echo ftlook
sed 's/^X//' >ftlook <<'!'
X#!/bin/rc
X# fulltext [-i index] word... - search full-text index
fn usage {
X	echo usage: $1 '[-i index]' word... >[1=2]
X	exit usage
X}

idx=.index
if (test $#* -ge 2)
X	switch ($1) {
X	case -i
X		idx=$2
X		shift 2
X	case -*
X		usage $0
X	}
if (test $#* -lt 1)
X	usage $0

X{
X	for (arg) {
X		echo $arg
X		look -df -t ' ' $arg^' ' $idx
X	}
X} |
X	sort +1 |
X	awk '
function delwd(wd) { delete fileword[wd] }
function prmatch(w) {
X	any = 0
X	for (w in fileword)
X		any++
X	if (any == 0 && lastf != "")
X		print lastf
X}

BEGIN	{ lastf="" }
NF == 1	{ word[$1] = $1; fileword[$1] = $1; next }	# a word we must match
NF != 3 { print "badly formed index line: " $0 >"cat >[1=2]"; next }
X$2 == lastf { delwd($1); next }		# same old filename
X{
X	prmatch()

X	lastf = $2
X	for (w in fileword)
X		delete fileword[w]	# empty fileword
X	for (w in word)
X		fileword[w] = word[w]	# copy word to fileword
X	delwd($1)
X}
END	{ prmatch() }
X'
!
echo ftlookword
sed 's/^X//' >ftlookword <<'!'
X#!/bin/rc
cd $h/mbox
exec grep -n $1 `{ftlook -i /n/other/index/mail $1} /dev/null
!



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-11-02  0:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-01 10:27 [9fans] full text search nigel
2002-11-01 12:45 ` Kenji Arisawa
2002-11-02  0:02 Geoff Collyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).