Gnus development mailing list
 help / color / mirror / Atom feed
From: Harry Putnam <reader@newsguy.com>
Cc: ding@gnus.org,
	Norbert Gövert <goevert@amaunet.cs.uni-dortmund.de>,
	Ulrich Pfeifer <upf@de.uu.net>
Subject: Re: nnir/freeWAIS-sf
Date: 16 Jul 2000 09:17:41 -0700	[thread overview]
Message-ID: <m2em4uaw2i.fsf@reader.ptw.com> (raw)
In-Reply-To: Kai.Grossjohann@CS.Uni-Dortmund.DE's message of "Sun, 16 Jul 2000 14:25:42 +0200"

NOTE: If you want the punch line first.. skip straight to a section below
preceeded by:
"*NOW THE GOOD PART!*"

Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai Großjohann) writes:

> I have now found out something with respect to searching in `from' and
> `to' fields.
> 
> (For Norbert & Uli: I'm indexing my mail with freeWAIS-sf, and
> specifying `SOUNDEX BOTH' in the format file for the `to' and `from'
> fields.)

Above are you refering to the example *.fmt file in nnir-1.57 as it stands?

> 
> Running `dictionary mail_field_from' tells me that the dictionary
> contains a lot of soundex codes (which is the right thing, since we
> are specifying soundex in the format file).  And indeed, searching for
> the soundex codes works:

Ditto here if soundex codes look like:
term                   occurances  pointer
m300                     83886080  5687214
m320                    536870912  1810235
[...]


> 
> /----
> | $ waissearch -d mail from=m230
> | 
> |  Search Response:
> |   NumberOfRecordsReturned: 16
> |    1: Score: 3471, lines:  55 '445   /home-local/grossjoh/Mail/auto/linux-utf8/'
> |    2: Score: 3471, lines: 111 '491   /home-local/grossjoh/Mail/auto/linux-utf8/'
> |    3: Score: 3098, lines:  79 '3416   /home-local/grossjoh/Mail/auto/dbworld/'
> | [...]
> \----

> But searching for normal terms does _not_ work.

 I see the nearly the opposite behavior: From and To searches find nothing
 but free text and Subject searches work.


Using your example *.fmt file and this indexing command:
 waisindex -r -d mail -stem -t fields ~/Mail

Free text search (this data base contains my ding-list  and bbdb-list  nnml
directories):
waissearch -d mail agent

 Search Response:
  NumberOfRecordsReturned: 6
   1: Score: 1546, lines: 229 '2591   /home/reader/Mail/ding2/'
   2: Score: 1434, lines:  60 '2592   /home/reader/Mail/ding2/'
   3: Score: 1402, lines:  83 '2783   /home/reader/Mail/ding2/'
   4: Score: 1378, lines:  82 '1964   /home/reader/Mail/ding2/'
   5: Score: 1349, lines:  99 '2611   /home/reader/Mail/ding2/'
   6: Score: 1123, lines:  55 '2514   /home/reader/Mail/ding2/'
 
Subject search:
waissearch -d mail subject=give

 Search Response:
  NumberOfRecordsReturned: 6
   1: Score: 2484, lines:  87 '2771   /home/reader/Mail/ding2/'
   2: Score: 2222, lines: 153 '2791   /home/reader/Mail/ding2/'
   3: Score: 2222, lines:  69 '2793   /home/reader/Mail/ding2/'
   4: Score: 2222, lines: 160 '2   /home/reader/Mail/bbdb/'
   5: Score: 2222, lines:  76 '3   /home/reader/Mail/bbdb/'
   6: Score: 2222, lines: 174 '11   /home/reader/Mail/bbdb/'

>From search:
waissearch -d mail from=Ronan
 Search Response:
  NumberOfRecordsReturned: 1
   1: Score:    0, lines:3457 'Search produced no result. Here's the Catalog for database: mail'

However by changing SOUNDEX BOTH to TEXT BOTH  I find that the from
search then works:

>From search with edited *.fmt file:
waissearch -d mail from=Ronan

 Search Response:
  NumberOfRecordsReturned: 4
   1: Score: 2403, lines: 153 '2791   /home/reader/Mail/ding2/'
   2: Score: 2403, lines:  69 '2793   /home/reader/Mail/ding2/'
   3: Score: 2403, lines: 160 '2   /home/reader/Mail/bbdb/'
   4: Score: 2403, lines:  76 '3   /home/reader/Mail/bbdb/'

At first I thought BOTH meant LOCAL and GLOBAL but I think now it doesn't

Because I found that if I set the field specific parts to LOCAL then a
free text search fails and reports that it cannot find the dataindex.

The last two lines of indexing output with the *.fmt fields set TEXT LOCAL 
Tells the story:

1731: 3481: Jul 16 08:37:55 2000: 100: Total word count for dictionary is: 0
1731: 3482: Jul 16 08:37:55 2000: -1: error finding total_word_count
in dictionary ./mail

No dictionary is built.

*NOW THE GOOD PART!*

Pouring through the info file I found this passage that finally might
be an explanation of what is needed.  And to corroborate this I tried
setting `To' and `From" to SOUNDEX LOCAL TEXT BOTH like this:

 to "To and Cc headers" SOUNDEX LOCAL TEXT  BOTH

Telling the indexer to "put the word in the default and the  'to'[ed -hp]
category and its soundex code only in the `to'[ed -hp] category."

So BOTH here means .. the default (which is LOCAL) and the current catagory.

And all type of searches now work.... Whoopee

dictionary mail_field_from mail  is now more than twice its previous size 
and contains both SOUNDEX and TEXT code.

>From the INFO file: 

   Consider the following example:
   
        region: /^AU: /
                au "author names" SOUNDEX LOCAL TEXT BOTH
        end: /^[A-Z][A-Z]:/
   
   To the indexer this means:
   
   For all words starting with `AU: ' at the beginning of a line up to a
   line which starts with two capital letters followed by a colon and a
   blank, put the word in the default and the `au' category and its soundex
   code only in the `au' category.
   
   Thus an author name can be found in the created database in the default
   category or the `au' category if the exact spelling is known. If the
   name is misspelled, it might be found using the query `au=(soundex
   MISSPELLED-NAME)'. *Note Sample Format::, *Note Query Syntax::.


Included working *.fmt file:

 # Harry's rendition of Kai's  format file for freeWAIS-sf for indexing mails.
 # Each mail is in a file, much like the MH format.
                                         
 # Document separator should never match -- each file is a document.
 record-sep: /^@this regex should never match@$/
                                         
 # Searchable fields specification.      
                                         
 region: /^[sS]ubject:/ /^[sS]ubject: */ 
         subject "Subject header" stemming TEXT BOTH
 end: /^[^ \t]/                          
                                         
 region: /^([tT][oO]|[cC][cC]):/ /^([tT][oO]|[cC][cC]): */
         to "To and Cc headers" SOUNDEX LOCAL TEXT  BOTH
 end: /^[^ \t]/                          
                                         
 region: /^[fF][rR][oO][mM]:/ /^[fF][rR][oO][mM]: */
         from "From header" SOUNDEX LOCAL TEXT BOTH
 end: /^[^ \t]/                          
                                         
 region: /^$/                            
         stemming TEXT GLOBAL           
 end: /^@this regex should never match@$/



  reply	other threads:[~2000-07-16 16:17 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2000-07-15 13:53 nnir/freeWAIS-sf Harry Putnam
2000-07-15 18:04 ` nnir/freeWAIS-sf Norman Walsh
2000-07-15 18:10 ` nnir/freeWAIS-sf Francisco Solsona
2000-07-15 21:22   ` nnir/freeWAIS-sf Harry Putnam
2000-07-17 13:51     ` nnir/freeWAIS-sf Francisco Solsona
2000-07-18  1:03       ` nnir/freeWAIS-sf Harry Putnam
2000-07-18  9:06         ` nnir/freeWAIS-sf Kai Großjohann
2000-07-19  0:57           ` nnir/freeWAIS-sf Harry Putnam
2000-07-20 14:34             ` nnir/freeWAIS-sf Kai Großjohann
2000-07-20 18:13               ` nnir/freeWAIS-sf Harry Putnam
2000-07-21 17:31                 ` nnir/freeWAIS-sf Kai Großjohann
2000-07-21 22:35                   ` nnir/freeWAIS-sf Harry Putnam
2000-07-16 12:25 ` nnir/freeWAIS-sf Kai Großjohann
2000-07-16 16:17   ` Harry Putnam [this message]
2000-07-16 21:43     ` nnir/freeWAIS-sf Kai Großjohann
2000-07-16 22:22       ` nnir/freeWAIS-sf Harry Putnam
2000-07-20 14:44         ` nnir/freeWAIS-sf Kai Großjohann
2000-07-16 23:08       ` nnir/freeWAIS-sf Harry Putnam
2000-07-20 14:48 ` nnir/freeWAIS-sf Kai Großjohann
2000-07-20 16:33   ` nnir/freeWAIS-sf Harry Putnam
2000-07-21 17:27     ` nnir/freeWAIS-sf Kai Großjohann
2000-07-21 22:04       ` nnir/freeWAIS-sf Harry Putnam
2000-07-21 22:34         ` nnir/freeWAIS-sf Kai Großjohann
2000-07-21 23:12           ` nnir/freeWAIS-sf Harry Putnam
2000-07-22 11:59             ` nnir/freeWAIS-sf Kai Großjohann
2000-07-22 13:40               ` nnir/freeWAIS-sf Harry Putnam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m2em4uaw2i.fsf@reader.ptw.com \
    --to=reader@newsguy.com \
    --cc=ding@gnus.org \
    --cc=goevert@amaunet.cs.uni-dortmund.de \
    --cc=upf@de.uu.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).