* [9fans] full text search
@ 2002-11-01 10:27 nigel
2002-11-01 12:45 ` Kenji Arisawa
0 siblings, 1 reply; 3+ messages in thread
From: nigel @ 2002-11-01 10:27 UTC (permalink / raw)
To: 9fans
I'd like to offline (= overnight) index 'text' files so I can search them. Examples
would be email, source code. Any recommendations of suitable code to port?
I looked at glimpse, but was unsure about the license.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [9fans] full text search
2002-11-01 10:27 [9fans] full text search nigel
@ 2002-11-01 12:45 ` Kenji Arisawa
0 siblings, 0 replies; 3+ messages in thread
From: Kenji Arisawa @ 2002-11-01 12:45 UTC (permalink / raw)
To: 9fans
Hello Nigel,
>I'd like to offline (= overnight) index 'text' files so I can search
them. Examples
>would be email, source code. Any recommendations of suitable code to
port?
>I looked at glimpse, but was unsure about the license.
Look
http://www.namazu.org/
Namazu is the most popular full-text search engine in Japan.
It is assumed that text is written in ujis , I believe, but may be easy
to support utf-8.
The web site is written in English.
Therefore you can read the documents.
Kenji Arisawa
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [9fans] full text search
@ 2002-11-02 0:02 Geoff Collyer
0 siblings, 0 replies; 3+ messages in thread
From: Geoff Collyer @ 2002-11-02 0:02 UTC (permalink / raw)
To: 9fans
I've been meaning to port lq-text. It's relatively small and claimed
to be 8-bit-clean.
In the meantime, I use the scripts below, written in part to learn
more about the problem.
I index my mail each night from cron:
0 7 * * 0 cpu cd mbox && ftindex * */* >/n/other/index/mail
I can then search the 95MB of it fairly quickly:
: alpha; time ftlook -i $h/oth/index/mail boyd alpha firmware
2002oct25
9.1998
arch/9fans.2001jul11
arch/9fans.2001mar28
arch/9fans.2001nov16
arch/9fans.2001oct25
arch/9fans.2002aug11
arch/9fans.2002may13
arch/9fans.2002may27
arch/9fans.2002oct25
arch/9fans.2002sep18
0.05u 0.07s 0.26r ftlook -i /usr/geoff/oth/index/mail boyd alpha ...
The named files contain all three words, ignoring case distinctions.
ftlookword is specialised to use my mail index file and shows matching
lines with line numbers:
: alpha; time ftlookword mmc
9.1998:14172: in case of problems. the device name to use for the new yamahas is 'mmc'
9.1998:14175: 1) by default for now, an mmc device is in test-write mode. this means that
9.1998:14180: it should be followed by a message saying "mmcgettoc: blank disc".
9.1998:14212: scsi support didn't include the scsi-3 mmc command set).
9.1998:14333: when i changed it to use the mmc cdda read command rather
9.1998:14678: for mmc devices.
arch/9fans.2001aug29:228: mmc.c does have this:
arch/9fans.2001mar28:14272: hp scsi mmc optical jukeboxes (now in 9gb disk size!) are easily
arch/9fans.2001mar28:14383: > money, hp scsi mmc optical jukeboxes (now
arch/9fans.2001mar28:16889: boundary="upas-mnfzmmctbofdpmrurgpojpeugo"
arch/9fans.2001mar28:16902: --upas-mnfzmmctbofdpmrurgpojpeugo
arch/9fans.2001mar28:16919: --upas-mnfzmmctbofdpmrurgpojpeugo
arch/9fans.2001mar28:16983: --upas-mnfzmmctbofdpmrurgpojpeugo--
arch/9fans.2002jul1:22387: diff /n/d/acme/bin/source/acd//mmc.c ./mmc.c=0A=
arch/9fans.2002oct25:11409: i3gqQNvpcMMNOgFyXmD2htxpBR3ZNGOFi6HC3Cw0H9nVvUVdC/3jWO6eN4InFBeifFUxmmc65T8W
optical:819: scsi support didn't include the scsi-3 mmc command set).
0.62u 0.21s 2.85r ftlookword mmc
Note that this matched upas mime boundary lines; apparently I need to
fine-tune the definition of `word'.
# To unbundle, run this file
echo ftindex
sed 's/^X//' >ftindex <<'!'
X#!/bin/rc
X# index file... - generate full-text index
X# indices can be combined via `sort -o bigindex -udf index*'
if (~ $#* 0)
X * = /fd/0
X# there's a lot of redundancy in the awk output, so strip duplicates
X# in the output for each input file, then combine the stripped outputs.
X{
X for (f) {
X # limiting line & word length avoids indexing uuencoded &
X # base64-encoded text.
X awk '
X$0 != "" && (length($0) < 40 || $0 ~ /[^\t ][\t ][^\t ]/) {
X $0 = tolower($0)
X gsub(/=a0/, " ")
X gsub(/[\/.,:;?!<>()[\]{}*=#%"''~|&^\\]/, " ") # delete most specials
X for (i = 1; i <= NF; i++)
X if (length($i) < 20 &&
X $i ~ /^[a-z\/][a-z0-9\-_\/]*[a-z0-9]$/ &&
X $i !~ /^(x-|message-id:)/)
X print $i, FILENAME, NR
X}' $f |
X sort -udf +0 -2
X }
X} |
X sort -udf +0 -2
!
echo ftlook
sed 's/^X//' >ftlook <<'!'
X#!/bin/rc
X# fulltext [-i index] word... - search full-text index
fn usage {
X echo usage: $1 '[-i index]' word... >[1=2]
X exit usage
X}
idx=.index
if (test $#* -ge 2)
X switch ($1) {
X case -i
X idx=$2
X shift 2
X case -*
X usage $0
X }
if (test $#* -lt 1)
X usage $0
X{
X for (arg) {
X echo $arg
X look -df -t ' ' $arg^' ' $idx
X }
X} |
X sort +1 |
X awk '
function delwd(wd) { delete fileword[wd] }
function prmatch(w) {
X any = 0
X for (w in fileword)
X any++
X if (any == 0 && lastf != "")
X print lastf
X}
BEGIN { lastf="" }
NF == 1 { word[$1] = $1; fileword[$1] = $1; next } # a word we must match
NF != 3 { print "badly formed index line: " $0 >"cat >[1=2]"; next }
X$2 == lastf { delwd($1); next } # same old filename
X{
X prmatch()
X lastf = $2
X for (w in fileword)
X delete fileword[w] # empty fileword
X for (w in word)
X fileword[w] = word[w] # copy word to fileword
X delwd($1)
X}
END { prmatch() }
X'
!
echo ftlookword
sed 's/^X//' >ftlookword <<'!'
X#!/bin/rc
cd $h/mbox
exec grep -n $1 `{ftlook -i /n/other/index/mail $1} /dev/null
!
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2002-11-02 0:02 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-01 10:27 [9fans] full text search nigel
2002-11-01 12:45 ` Kenji Arisawa
2002-11-02 0:02 Geoff Collyer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).