Gnus development mailing list
 help / color / mirror / Atom feed
* Searching a news server
@ 2002-01-06 19:24 Lars Magne Ingebrigtsen
  2002-01-06 23:39 ` Russ Allbery
  0 siblings, 1 reply; 5+ messages in thread
From: Lars Magne Ingebrigtsen @ 2002-01-06 19:24 UTC (permalink / raw)


(I think the ding mailing list must be backlogged or something,
because it's five hours since the previous message to the list was
delivered here...)

I've now written scripts for swishing over the spool and searching the
indexes.  I've done it in a fairly straightforward manner -- the
indexing script goes over the spool and indexes any group that is
newer than the index for that group.  (The index files are kept in a
separate spool that apes the main spool.)

The searching function takes a group name, the requested number of
matches, and the search string in this format:

gnu.emacs.gnus:100:emacs sex mpeg

(which I think will be a pretty representative search string.)

Now, if someone would hack inn to take an XSEARCH command, then I
could have Gnus offer a command to use this...

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Searching a news server
  2002-01-06 19:24 Searching a news server Lars Magne Ingebrigtsen
@ 2002-01-06 23:39 ` Russ Allbery
  2002-01-19 20:46   ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 5+ messages in thread
From: Russ Allbery @ 2002-01-06 23:39 UTC (permalink / raw)


Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Now, if someone would hack inn to take an XSEARCH command, then I could
> have Gnus offer a command to use this...

I have a tree that contains a hack like this, but I haven't looked to see
how ugly it was.  The following expired I-D may be of interest.


INTERNET-DRAFT                                     N. Ballou (Microsoft)
Expires: December 1, 1997               B. Hernacki & B. Polk (Netscape)
<draft-ballou-nntpsrch-03.txt>                               May 1, 1997



                   NNTP Full-text Search Extension 



1.  Status of this Memo 

This  document is an Internet-Draft.   Internet-Drafts are working docu-
ments of the Internet Engineering Task Force (IETF),  its areas, and its
working groups.  Note that   other groups  may also  distribute  working
documents as Internet-Drafts. 

Internet-Drafts are draft documents valid   for a maximum of six  months
and may be updated,  replaced, or obsoleted   by other documents  at any
time.  It is inappropriate to use Internet- Drafts as reference material
or to cite them other than as ``work in progress.'' 

To  learn the current status   of any  Internet-Draft, please check  the
``1id-abstracts.txt''  listing  contained in  the Internet-Drafts Shadow
Directories on ds.internic.net  (US East Coast), nic.nordu.net (Europe),
ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). 

2.  Abstract 

This  document describes  a   set of enhancements  to the   Network News
Transport  Protocol [NNTP-977] that  allows  full-text searching of news
articles in multiple newsgroups.   The proposed SEARCH command  supports
functionality similar to the [IMAP4] SEARCH command, minus user specific
search keys (i.e., ANSWERED,  DRAFT, FLAGGED, KEYWORD, NEW, OLD, RECENT,
SEEN) and minus search keys based  on headers that  do not exist in news
(i.e., CC, BCC, TO).

The availability of the extensions described  here will be advertised by
the  server using  the extension negotiation-mechanism  described in the
new NNTP protocol specification currently being developed [NNTP-NEW]. 

3.  Introduction 

The NNTP SEARCH command is sent from the client to the server to specify
and initiate a full-text  search on articles  in one or more newsgroups.
The NNTP SEARCH command is a subset of the  [IMAP4] SEARCH command, with
user property  and mail-specific header search  keys not present in NNTP
SEARCH.   The results of  an NNTP  Search  is OVER data  as specified in 
[NNTP-NEW] for each article that satisfies the search criteria. 

In addition, the XPAT command is extended so that  it  can  be  used  to
full-text  search  articles within a single newsgroup.  Both the headers
and the body of the articles are searched.

3.1.  New and Enhanced NNTP Commands

There are four new NNTP commands, three new options to the existing LIST
command, and enhancements to one existing command.

*    SEARCH

*    LIST SRCHFIELDS

*    LIST SEARCHABLE

*    XPAT

The SEARCH command runs a one-time search, returning overview-like data.

The LIST SRCHFIELDS command returns the fields that the server allows in
full-text searches.

The LIST SEARCHABLE command allows the client to determine  which  news-
groups are full-text searchable.

The XPAT command allows the pseudo-header  ":TEXT".   This  specifies  a
full-text  (headers  and  body) search of the articles in a single news-
group.

4.  Use of NNTP Extension Mechanism

The NNTP extension mechanism allows a server to describe  its  capabili-
ties.   The  following  extensions are used to describe the capabilities
described in this document.

4.1.  SEARCH Extension

The SEARCH extension means that the server supports the  following  com-
mands: SEARCH, LIST SEARCHABLE, LIST SRCHFIELDS.

4.2.  XPATTEXT Extension

The XPATTEXT extension means that the server supports the  :TEXT  header
in the XPAT command, as described by this document.

5.  Command Descriptions

5.1. SEARCH Command

Arguments: optional character set specification 
           optional newsgroup specification
           searching criteria (one or more) 

Responses: 224 overview information follows
           412 no news group selected
           462 error performing search
           501 command syntax error
           502 no permission

The SEARCH  command searches the newsgroup for  articles that  match the
given searching criteria.   Searching  criteria consist of one   or more
search keys.  If there are articles that  match the search criteria, the
server responds with  code 224 and returns  OVER data for each  matching
article in a similar format as described  in [NNTP-NEW].  The one change
from  [NNTP-NEW]  OVER  format  is  to  change the article number  field 
to a format that supports searches over multiple newsgroups. The article 
ID  field  for  SEARCH  OVER  data  will use the format newsgroup:art-ID 
rather than just an article number as defined in [NNTP-NEW]. 

A response of 421 indicates  that there are  no articles that match  the
search  criteria.  A  response  of 501 indicates a   syntax error in the
search  criteria.  A response  of 502 indicates   that the user does not
have permission to search one  or more of the  specified newsgroups.  If
the search criteria did not specify a newsgroup, and there is no current
newsgroup  (i.e.,  set using the NNTP   GROUP command), then  the server
returns  the error  code 412,   indicating that  no newsgroup has   been
specified.   A response of 462  indicates that the server encountered an
error when processing the search.

When multiple keys  are specified, the result  is the  intersection (AND
function) of all the  messages that match  those keys.  For example, the
criteria FROM "SMITH" SINCE 1-Feb-1994 refers to all articles from Smith
that were placed in the newsgroup since February 1,  1994.  A search key
may also be a parenthesized list of one  or more search  keys (e.g.  for
use with the OR and NOT keys). 

Server  implementations  MAY exclude [MIME-1]  body  parts with terminal
content  types other than TEXT and  MESSAGE from consideration in SEARCH
matching. 

The optional character set  specification consists of the word "CHARSET"
followed by a registered MIME character set.  It indicates the character
set of the strings that appear in the search criteria.  [MIME-2] strings
that   appear in  RFC 822/MIME  message   headers, and [MIME-1]  content
transfer  encodings,  MUST be decoded     before matching.  Except   for
US-ASCII, it    is not required  that  any  particular character  set be
supported.  If the server does  not support the specified character set,
a 462 error code is returned.

The optional newsgroup specification consists of the word "IN"  followed
by  either  a  wildcard  character  "*"  -  indicating a search over all 
newsgroups  - or a list  of  newsgroup  names  separated  by a comma.  A 
newsgroup name can end with the wildcard string ".*" indicating a search 
over  a  sub-hierarchy  of  the  newsgroup name  space.  If no newsgroup 
specification  is  given,  the search is over the current newsgroup.  If 
there is no current newsgroup, the server returns the 412 error code.

In all search  keys that use strings,  a message matches  the key if the
string is a substring of the field.  The matching is case-insensitive. 

The ON, BEFORE, and SINCE search criteria use the same  date as  used in
the NNTP NEWNEWS command - the date the article arrived  on  the server.
A server indicates support for the ON, BEFORE, and SINCE search criteria
by listing :Date in the LIST SRCHFIELDS response.

The defined   search keys are as  follows.   Refer to the  Formal Syntax
section for the precise syntactic definitions of the arguments. 

      <message range> Articles with article numbers corresponding to the 
                      specified range. 

      ALL             All Articles in the current newsgroup; the default 
                      initial key for ANDing.

      BEFORE <date>   Articles whose server arrival date is earlier than 
                      the specified date.

      BODY <string>   Articles that contain the specified string in the
                      body of the message.

      FROM <string>   Articles that contain the specified string in the
                      article structure's FROM field.

      HEADER <field-name> <string>
                      Articles that have a header with the specified
                      field-name (as defined in [RFC-822]) and that
                      contains the specified string in the [RFC-822]
                      field-body.

      LARGER <n>     Articles with an size larger than the specified 
                     number of octets.

      NOT <search-key>
                     Articles that do not match the specified search
                     key.

      ON <date>      Articles whose server arrival date is within the
                     specified date.

      OR <search-key1> <search-key2>
                     Articles that match either search key.

      SENTBEFORE <date>
                     Articles whose [RFC-822] Date: header is earlier
                     than the specified date.

      SENTON <date>  Articles whose [RFC-822] Date: header is within the
                     specified date.

      SENTSINCE <date>
                     Articles whose [RFC-822] Date: header is within or
                     later than the specified date.

      SINCE <date>   Articles whose server arrival date is within or 
                     later than the specified date.

      SMALLER <n>    Articles with a size smaller than the specified 
                     number of octets.

      SUBJECT <string>
                     Articles that contain the specified string in the
                     envelope structure's SUBJECT field.

      TEXT <string>  Articles that contain the specified string in the
                     header or body of the message.

   Example: C: SEARCH FROM "Smith" SINCE 1-Feb-1994
            S: 224 overview information follows
            S: comp.object:573 \t RE: object-oriented langs \t \
               "John Smith" <JSmith@xyz.com> \t Sun, 03 Nov 1996 \
               14:25:05 -0800 \t <01cbc9d5f3c70$eab9a2cd@xyz.com> \
               \t 4080 \t 33 
            S: .

   Note: each field in OVER response is separated by a tab - shown as a
         \t in the example above.

5.1.1.  Search Formal Syntax

The search query syntax is derived from the search  syntax  defined  for
the  IMAP4 protocol.  It is somewhat different because of the way inter-
national character sets need to be encoded.  

The following syntax specification  uses the augmented Backus-Naur  Form
(BNF) notation  as   specified  in  [RFC-822]

Except as   noted otherwise,  all    alphabetic characters   are   case-
insensitive.  The use of upper or  lower case characters to define token
strings is  for editorial  clarity  only.  Implementations   MUST accept
these strings in a case-insensitive fashion.

   astring         ::= atom / string

   atom            ::= 1*ATOM_CHAR

   ATOM_CHAR       ::= <any CHAR except atom_specials>

   atom_specials   ::= "(" / ")" / SPACE / CTL / "*" / quoted_specials

   CHAR            ::= <any 7-bit US-ASCII character except NUL,
                        0x01 - 0x7f>

   CTL             ::= <any ASCII control character and DEL,
                        0x00 - 0x1f, 0x7f>

   date            ::= date_text / <"> date_text <">

   date_day        ::= 1*2digit
                       ;; Day of month

   date_month      ::= "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" /
                       "Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec"

   date_text       ::= date_day "-" date_month "-" date_year

   date_year       ::= 4digit

   digit           ::= "0" / digit_nz

   digit_nz        ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" /
                       "9"

   header_fld_name ::= sstring

   mstring         ::= A MIME-2 encoded string surrounded by double
                       quotes

   newsgroup       ::= atom [ ".*"]

   newsgroups      ::= "*" / newsgroup_list

   newsgroup_list  ::= newsgroup [ ","  newsgroup_list]

   number          ::= 1*digit
                       ;; Unsigned 32-bit integer
                       ;; (0 <= n < 4,294,967,296)

   nz_number       ::= digit_nz *digit
                       ;; Non-zero unsigned 32-bit integer
                       ;; (0 < n < 4,294,967,296)

   QUOTED_CHAR     ::= <any TEXT_CHAR except quoted_specials> /
                       "\" quoted_specials

   quoted_specials ::= <"> / "\"

   range           ::= nz_number / nz_number "-" [ nz_number ]
                       ;; Identifies a range of Articles.

   search          ::= "SEARCH" SPACE ["CHARSET" SPACE astring SPACE]
                       ["IN" SPACE newsgroups SPACE]
                       1#search_key
                       ;; [CHARSET] MUST be registered with IANA

   search_key      ::= "ALL" / "BODY" SPACE sstring / 
                       "FROM" SPACE sstring / "ON" SPACE date / 
                       "SINCE" SPACE date / "BEFORE" SPACE date / 
                       "SUBJECT" SPACE sstring / "TEXT" SPACE sstring / 
                       "HEADER" SPACE header_fld_name SPACE sstring /
                       "LARGER" SPACE number / "NOT" SPACE search_key /
                       "OR" SPACE search_key SPACE search_key /
                       "SENTBEFORE" SPACE date / "SENTON" SPACE date /
                       "SENTSINCE" SPACE date / "SMALLER" SPACE number /
                       range / "(" 1#search_key ")"

   SPACE           ::= 1*<ASCII SP, space, 0x20>

   sstring         ::= astring | mstring 

   string          ::= <"> *QUOTED_CHAR <">

   TEXT_CHAR       ::= <any CHAR except CR and LF>

5.2.  LIST SRCHFIELDS Command

Arguments: none

Responses: 224 data follws

The  LIST  SRCHFIELDS  command  Returns  a  list of which fields can  be  
specified  in  full-text  search queries on the server.  The response is
a  list  of  searchable  fields,  one  per  line.  A "." on its own line  
terminates  the  list.   The  fields  are  either  newsgroup headers, or
non-header fields supported by the query syntax.

The three currently defined non-header fields are ":Body", ":Text",  and
":Date".  ":Text"  means  all  the  searchable  text in the article, and 
indicates  that  the  "text"  keyword  is  supported in the search query 
language.  ":Body" means the body of the article, excluding the headers, 
and  indicates  that the "body" keyword is supported in the search query 
language.  ":Date"  means  the  date  at  which  an article arrived on a 
server  -  similar  to  the  date used in the NNTP NEWNEWS command - and
indicates that the "ON", "SINCE", and "BEFORE" keywords are supported in
the search query language.

The "date", "text" and "body" search query fields are optional, but  the  
server  must  indicate  whether  they  are  supported or not in the LIST 
SRCHFIELDS response.

   Example: C: LIST SRCHFIELDS
            S: 224 Data follows.
            S: From
            S: Date
            S: Subject
            S: :Text
            S: .

5.3.  LIST SEARCHABLE Command

Arguments: none

Responses: 224 Data Follows

The LIST SEARECHABLE command returns a list of strings that define which 
new groups are being indexed by  the  news server and are thus available 
for  searching.  In  addition, the character sets allowed for each group 
is returned.

When there are newsgroups indexed it will return 224, followed  by  each
portion  of the tree that is indexed.  If all groups are indexed, a line
with "*" is returned.  If only some parts of the newsgroup hierarchy are
indexed, they are identified in the form <indexed-hierarchy>.*.  Clients
should not assume that these will always be top  level  hierarchies.   A
"." on its own line terminates the list.

The character sets allowed in full-text searches for each entry is  also
returned.   The  character sets are identified by the name as defined in
[MIME-1].

   Example: C: LIST SEARCHABLE
            S: 224 Data follows.
            S: alt.* US-ASCII
            S: comp.lang.* US-ASCII ISO-8859-1 ISO-8859-2
            S: mcom.* ISO-8859-1
            S: .

5.3 XPAT command enhancement

Arguments: header range|<message-id> pat [pat...]

Responses: <same as XPAT - see [NNTP-NEW]>

The XPAT command is enhanced in a simple way: The new value ":TEXT" will
be  supported  as  a header when invoking the command.  The :TEXT header
requests a full-text search the body and all headers  of  the  specified
articles.

When :TEXT is specified for the header, only a single "pat" is  allowed,
and  it  must  be  a  word or quoted string to search for, rather than a 
wildmat pattern as allowed otherwise.

If :TEXT isn't specified as the header, the response is the same  as  it
always  has  been for XPAT, with each result line containing the article
number and the value of the header that matched the pattern.

If the :TEXT header is specified, the constant string "TEXT" is returned
in place of the value of the header that matched the pattern.

  Example: C: XPAT :TEXT 1000-2000 searchtext
           S: 221 Header follows
           S: 1021 TEXT
           S: 1024 TEXT
           S:.

6.  Security Considerations

The search commands must be implemented in a way  that  does  not  allow
access  to  articles in newsgroups that a client is otherwise restricted
from reading due to access control rules.

7.  Bibliography 

[NNTP-977] 
     Network News Transfer Protocol.  B. Kantor, Phil Lapsley, Request 
     for Comment (RFC) 977, February 1986. 

[NNTP-NEW] 
     Network News Transfer Protocol.  S.  Barber INTERNET DRAFT, Sep- 
     tember 1996. 

[IMAP4] 
     IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4.  M Crispin, 
     Request for Comment (RFC) 1730, December 1994 


[MIME-1] 
     Borenstein N., and N.  Freed, MIME (Multipurpose Internet Mail 
     Extensions) Part One: Mechanisms for Specifying and Describing the 
     Format of Internet Message Bodies, <A HREF="/epub/doc/idoc/rfc/rfc1521.html">RFC 1521</A>, Bellcore, Innosoft, 
     September 1993. 

[MIME-2] 
     Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two: 
     Message Header Extensions for Non-ASCII Text, <A HREF="/epub/doc/idoc/rfc/rfc1522.html">RFC 1522</A>, University 
     of Tennessee, September 1993. 


8.  Author's Address 

   Nat Ballou 
   Microsoft 
   One Microsoft Way 
   Redmond, WA 98052 
   USA 

   Phone: +1 206-703-0574 
   Email: natba@microsoft.com 


                  This Internet Draft expires April xx, 1997. 

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Searching a news server
  2002-01-06 23:39 ` Russ Allbery
@ 2002-01-19 20:46   ` Lars Magne Ingebrigtsen
  2002-02-16 21:16     ` Russ Allbery
  0 siblings, 1 reply; 5+ messages in thread
From: Lars Magne Ingebrigtsen @ 2002-01-19 20:46 UTC (permalink / raw)


Russ Allbery <rra@stanford.edu> writes:

> I have a tree that contains a hack like this, but I haven't looked to see
> how ugly it was.  The following expired I-D may be of interest.

[...]

>                    NNTP Full-text Search Extension 

I think this sounds quite interesting.  If you could pull the hack out
of the tree, I could try running it on Quimby, and then implement the
commands in Gnus.  Since most people who read news from Quimby uses
Gnus, that would give us a good test-bed for seeing whether this is
the right way to implement this...

-- 
(domestic pets only, the antidote for overdose, milk.)
   larsi@gnus.org * Lars Magne Ingebrigtsen



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Searching a news server
  2002-01-19 20:46   ` Lars Magne Ingebrigtsen
@ 2002-02-16 21:16     ` Russ Allbery
  2002-02-17 10:03       ` Kai Großjohann
  0 siblings, 1 reply; 5+ messages in thread
From: Russ Allbery @ 2002-02-16 21:16 UTC (permalink / raw)


Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
> Russ Allbery <rra@stanford.edu> writes:

>> I have a tree that contains a hack like this, but I haven't looked to see
>> how ugly it was.  The following expired I-D may be of interest.

> [...]

>>                    NNTP Full-text Search Extension 

> I think this sounds quite interesting.  If you could pull the hack out
> of the tree, I could try running it on Quimby, and then implement the
> commands in Gnus.  Since most people who read news from Quimby uses
> Gnus, that would give us a good test-bed for seeing whether this is the
> right way to implement this...

I went back and looked at the tree that I actually had, and unfortunately
it's considerably more complex (it does stuff like set up search profiles
and then refile articles into various special groups, and it's very ugly
and not clearly the right way to do things).  So it looks like someone
trying to implement this would have to start pretty much from scratch.  :/

I'll note from experience with things like this that the hard part in
making full-text searching acceptably fast will be incremental indexing of
each article as it comes in.  Many of the existing search engines suck at
incremental indexing.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Searching a news server
  2002-02-16 21:16     ` Russ Allbery
@ 2002-02-17 10:03       ` Kai Großjohann
  0 siblings, 0 replies; 5+ messages in thread
From: Kai Großjohann @ 2002-02-17 10:03 UTC (permalink / raw)
  Cc: ding

Russ Allbery <rra@stanford.edu> writes:

> I'll note from experience with things like this that the hard part in
> making full-text searching acceptably fast will be incremental indexing of
> each article as it comes in.  Many of the existing search engines suck at
> incremental indexing.

Most of the weighting functions used seem to use normalization of
some kind, so the indexing weight for a given term in a given
document depends on the complete set of documents.  So adding a
document means that, strictly speaking, you have to reindex the whole
collection.  Hmpf.

There is a hack for freeWAIS-sf which allows you to add N documents
incrementally, with skew of weights.  After more than N documents
have been added, it reindexes the whole collection.  Maybe that's a
suitable workaround.

freeWAIS-sf seems to be a bear to build...

kai
-- 
~/.signature is: umop 3p!sdn    (Frank Nobis)



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-02-17 10:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-01-06 19:24 Searching a news server Lars Magne Ingebrigtsen
2002-01-06 23:39 ` Russ Allbery
2002-01-19 20:46   ` Lars Magne Ingebrigtsen
2002-02-16 21:16     ` Russ Allbery
2002-02-17 10:03       ` Kai Großjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).