Gnus development mailing list
 help / color / mirror / Atom feed
* Notes regarding wais *.fmt files
@ 2000-07-22 15:27 Harry Putnam
  2000-08-09 14:26 ` Janne Rinta-Manty
  0 siblings, 1 reply; 2+ messages in thread
From: Harry Putnam @ 2000-07-22 15:27 UTC (permalink / raw)



Observations and questions about the (notorious) *.fmt file.

Indexing command:
   waisindex -r -d mail -stem -t fields ~/Mail

Experiments have shown here that the example format file in nnir.el
*Does not* work as it stands.  Some aspects work when the query is
presented correctly.

An example is  the SOUNDEX query on the fields
`from' and `to'.  This works when submitted correctly.

As Kai has pointed out it needs to look like this:
Query: from=(soundex Francisco)  

[  30: Paul Franklin       ] [2525: bbdb/20] Re: Postal codes
[  14: Karl Eichwalder     ] [2253: ding2/1987] Re: The <word>.<word>..
[  88: Francisco Solsona   ] [2086: ding2/2830] Re: nnir/freeWAIS-sf
[  66: Francisco Solsona   ] [2086: ding2/2810] Re: nnir/freeWAIS-sf

  (NOTE: The closest thing I could find to `Francisco' in the
  Eichwalder hit was: Received: from rachael.franken.de
  (rachael.franken.de [193.175.24.38]))

Now making the same query without SOUNDEX:

Query: from=Francisco
(Fails)

Subject queries work when formulated as below:
Query: subject=WORD 
       subject=(WORD and WORD2)
       subject=(nnimap and agent) and (from=soundex paul) 

 
Any query that is a `free text' query  fails.

Query: WORD
(Fails)

Trying to analyze this behavior seems to point up a problem with the
regexp at the end of the example file.

 region: /^$/                            
         stemming TEXT GLOBAL            
 end: /^@this regex should never match@$/

/^$/, as discovered by Francisco Solsana seems to be the problem.  Not
sure why since this is a standard way of finding the first blank line
after the headers.  Francisco has changed that to /^Xref:/ which seems
to work well in nnml groups.

My own experiments indicate that changing that regexp to:
/^\n/ also causes the GLOBAL section to start working.

With the region defined like this:

 region: /^\n/                            
         stemming TEXT GLOBAL            
 end: /^@this regex should never match@$/

Free text queries work:

Query: (agent and function)
[  47: Simon Josefsson     ] [180: ding2/2800] Re: (provide 'nnmaildir)
[  41: -> ding@gnus.org    ] [165: ding2/2114] generate nov data from.,.
[ 187: Simon Josefsson     ] [135: ding2/2591] [patch] gnus agent
[ 169: Shenghuo ZHU        ] [116: ding2/2565] Re: MIME Security with PGP (RFC2015)
[ 624: Lars Magne Ingebrigt] [93: ding2/2594] Gnus v5.8.7 is released

Questions:

In a region defined like above with no match for an end, where and
when does regexp stop searching?  Does it just scan the full input?
That is, the entire incoming data stream.  Or does it restart on every
file?

Can freewais code be hacked somehow to allow an identifier for the end
of file to be recognized and specified in the `format' file?

It must already `know' about the end of each file, or it could not
supply file names to each `document'.  Is there some built in
mechanism that keeps track of and notices when a file has ended?

Finally, with non-matching beginning regexp  How does freewais know
about the beginning of a document?  Again there must be mechanism in
place that keeps track of that.  If so, can it be identified somehow in
the format file `record-sep:'?




^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Notes regarding wais *.fmt files
  2000-07-22 15:27 Notes regarding wais *.fmt files Harry Putnam
@ 2000-08-09 14:26 ` Janne Rinta-Manty
  0 siblings, 0 replies; 2+ messages in thread
From: Janne Rinta-Manty @ 2000-08-09 14:26 UTC (permalink / raw)


Harry Putnam 2000-07-22T15:29:59Z:
HP> Observations and questions about the (notorious) *.fmt file.

Even the freeWAIS-sf manual describes *.fmt files as 'horrible'.

HP> My own experiments indicate that changing that regexp [/^$/] to:
HP> /^\n/ also causes the GLOBAL section to start working.

I had the same results. It's strange, but seems to work.

HP> With the region defined like this:

HP> region: /^\n/                            
HP>          stemming TEXT GLOBAL            
HP> end: /^@this regex should never match@$/

HP> Free text queries work:
HP> Query: (agent and function)

Yes, in the body. A query like from=francisco still doesn't work with
the example format file; the fix is to add TEXT BOTH to the to and
from fields.

I was going to suggest changing the end regexp, too, because it looked
like if a message happened - for some weird reason :) - to contain a
line

@this regex should never match@

then everything after it would be left out from the index. However,
this seems not to be the case. The regexp doesn't match the above line
unless either the $ is dropped or \n is inserted before $.

Based on the freeWAIS-sf manual and experiments I think that

 1) any regexp that doesn't match any line matches the end of file
    (for example, /\n\n/, /^$/),

 2) an empty regexp (//) matches the beginning of file,

 3) /\n/, /^\n/, /\n$/, and /^\n$/ match an empty line, and

 4) the regexps don't work quite as one would expect.

-- 
Janne Rinta-Mänty



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2000-08-09 14:26 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-07-22 15:27 Notes regarding wais *.fmt files Harry Putnam
2000-08-09 14:26 ` Janne Rinta-Manty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).