* Searching a news server
@ 2002-01-06 19:24 Lars Magne Ingebrigtsen
2002-01-06 23:39 ` Russ Allbery
0 siblings, 1 reply; 5+ messages in thread
From: Lars Magne Ingebrigtsen @ 2002-01-06 19:24 UTC (permalink / raw)
(I think the ding mailing list must be backlogged or something,
because it's five hours since the previous message to the list was
delivered here...)
I've now written scripts for swishing over the spool and searching the
indexes. I've done it in a fairly straightforward manner -- the
indexing script goes over the spool and indexes any group that is
newer than the index for that group. (The index files are kept in a
separate spool that apes the main spool.)
The searching function takes a group name, the requested number of
matches, and the search string in this format:
gnu.emacs.gnus:100:emacs sex mpeg
(which I think will be a pretty representative search string.)
Now, if someone would hack inn to take an XSEARCH command, then I
could have Gnus offer a command to use this...
--
(domestic pets only, the antidote for overdose, milk.)
larsi@gnus.org * Lars Magne Ingebrigtsen
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Searching a news server
2002-01-06 19:24 Searching a news server Lars Magne Ingebrigtsen
@ 2002-01-06 23:39 ` Russ Allbery
2002-01-19 20:46 ` Lars Magne Ingebrigtsen
0 siblings, 1 reply; 5+ messages in thread
From: Russ Allbery @ 2002-01-06 23:39 UTC (permalink / raw)
Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
> Now, if someone would hack inn to take an XSEARCH command, then I could
> have Gnus offer a command to use this...
I have a tree that contains a hack like this, but I haven't looked to see
how ugly it was. The following expired I-D may be of interest.
INTERNET-DRAFT N. Ballou (Microsoft)
Expires: December 1, 1997 B. Hernacki & B. Polk (Netscape)
<draft-ballou-nntpsrch-03.txt> May 1, 1997
NNTP Full-text Search Extension
1. Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working docu-
ments of the Internet Engineering Task Force (IETF), its areas, and its
working groups. Note that other groups may also distribute working
documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference material
or to cite them other than as ``work in progress.''
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe),
ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim).
2. Abstract
This document describes a set of enhancements to the Network News
Transport Protocol [NNTP-977] that allows full-text searching of news
articles in multiple newsgroups. The proposed SEARCH command supports
functionality similar to the [IMAP4] SEARCH command, minus user specific
search keys (i.e., ANSWERED, DRAFT, FLAGGED, KEYWORD, NEW, OLD, RECENT,
SEEN) and minus search keys based on headers that do not exist in news
(i.e., CC, BCC, TO).
The availability of the extensions described here will be advertised by
the server using the extension negotiation-mechanism described in the
new NNTP protocol specification currently being developed [NNTP-NEW].
3. Introduction
The NNTP SEARCH command is sent from the client to the server to specify
and initiate a full-text search on articles in one or more newsgroups.
The NNTP SEARCH command is a subset of the [IMAP4] SEARCH command, with
user property and mail-specific header search keys not present in NNTP
SEARCH. The results of an NNTP Search is OVER data as specified in
[NNTP-NEW] for each article that satisfies the search criteria.
In addition, the XPAT command is extended so that it can be used to
full-text search articles within a single newsgroup. Both the headers
and the body of the articles are searched.
3.1. New and Enhanced NNTP Commands
There are four new NNTP commands, three new options to the existing LIST
command, and enhancements to one existing command.
* SEARCH
* LIST SRCHFIELDS
* LIST SEARCHABLE
* XPAT
The SEARCH command runs a one-time search, returning overview-like data.
The LIST SRCHFIELDS command returns the fields that the server allows in
full-text searches.
The LIST SEARCHABLE command allows the client to determine which news-
groups are full-text searchable.
The XPAT command allows the pseudo-header ":TEXT". This specifies a
full-text (headers and body) search of the articles in a single news-
group.
4. Use of NNTP Extension Mechanism
The NNTP extension mechanism allows a server to describe its capabili-
ties. The following extensions are used to describe the capabilities
described in this document.
4.1. SEARCH Extension
The SEARCH extension means that the server supports the following com-
mands: SEARCH, LIST SEARCHABLE, LIST SRCHFIELDS.
4.2. XPATTEXT Extension
The XPATTEXT extension means that the server supports the :TEXT header
in the XPAT command, as described by this document.
5. Command Descriptions
5.1. SEARCH Command
Arguments: optional character set specification
optional newsgroup specification
searching criteria (one or more)
Responses: 224 overview information follows
412 no news group selected
462 error performing search
501 command syntax error
502 no permission
The SEARCH command searches the newsgroup for articles that match the
given searching criteria. Searching criteria consist of one or more
search keys. If there are articles that match the search criteria, the
server responds with code 224 and returns OVER data for each matching
article in a similar format as described in [NNTP-NEW]. The one change
from [NNTP-NEW] OVER format is to change the article number field
to a format that supports searches over multiple newsgroups. The article
ID field for SEARCH OVER data will use the format newsgroup:art-ID
rather than just an article number as defined in [NNTP-NEW].
A response of 421 indicates that there are no articles that match the
search criteria. A response of 501 indicates a syntax error in the
search criteria. A response of 502 indicates that the user does not
have permission to search one or more of the specified newsgroups. If
the search criteria did not specify a newsgroup, and there is no current
newsgroup (i.e., set using the NNTP GROUP command), then the server
returns the error code 412, indicating that no newsgroup has been
specified. A response of 462 indicates that the server encountered an
error when processing the search.
When multiple keys are specified, the result is the intersection (AND
function) of all the messages that match those keys. For example, the
criteria FROM "SMITH" SINCE 1-Feb-1994 refers to all articles from Smith
that were placed in the newsgroup since February 1, 1994. A search key
may also be a parenthesized list of one or more search keys (e.g. for
use with the OR and NOT keys).
Server implementations MAY exclude [MIME-1] body parts with terminal
content types other than TEXT and MESSAGE from consideration in SEARCH
matching.
The optional character set specification consists of the word "CHARSET"
followed by a registered MIME character set. It indicates the character
set of the strings that appear in the search criteria. [MIME-2] strings
that appear in RFC 822/MIME message headers, and [MIME-1] content
transfer encodings, MUST be decoded before matching. Except for
US-ASCII, it is not required that any particular character set be
supported. If the server does not support the specified character set,
a 462 error code is returned.
The optional newsgroup specification consists of the word "IN" followed
by either a wildcard character "*" - indicating a search over all
newsgroups - or a list of newsgroup names separated by a comma. A
newsgroup name can end with the wildcard string ".*" indicating a search
over a sub-hierarchy of the newsgroup name space. If no newsgroup
specification is given, the search is over the current newsgroup. If
there is no current newsgroup, the server returns the 412 error code.
In all search keys that use strings, a message matches the key if the
string is a substring of the field. The matching is case-insensitive.
The ON, BEFORE, and SINCE search criteria use the same date as used in
the NNTP NEWNEWS command - the date the article arrived on the server.
A server indicates support for the ON, BEFORE, and SINCE search criteria
by listing :Date in the LIST SRCHFIELDS response.
The defined search keys are as follows. Refer to the Formal Syntax
section for the precise syntactic definitions of the arguments.
<message range> Articles with article numbers corresponding to the
specified range.
ALL All Articles in the current newsgroup; the default
initial key for ANDing.
BEFORE <date> Articles whose server arrival date is earlier than
the specified date.
BODY <string> Articles that contain the specified string in the
body of the message.
FROM <string> Articles that contain the specified string in the
article structure's FROM field.
HEADER <field-name> <string>
Articles that have a header with the specified
field-name (as defined in [RFC-822]) and that
contains the specified string in the [RFC-822]
field-body.
LARGER <n> Articles with an size larger than the specified
number of octets.
NOT <search-key>
Articles that do not match the specified search
key.
ON <date> Articles whose server arrival date is within the
specified date.
OR <search-key1> <search-key2>
Articles that match either search key.
SENTBEFORE <date>
Articles whose [RFC-822] Date: header is earlier
than the specified date.
SENTON <date> Articles whose [RFC-822] Date: header is within the
specified date.
SENTSINCE <date>
Articles whose [RFC-822] Date: header is within or
later than the specified date.
SINCE <date> Articles whose server arrival date is within or
later than the specified date.
SMALLER <n> Articles with a size smaller than the specified
number of octets.
SUBJECT <string>
Articles that contain the specified string in the
envelope structure's SUBJECT field.
TEXT <string> Articles that contain the specified string in the
header or body of the message.
Example: C: SEARCH FROM "Smith" SINCE 1-Feb-1994
S: 224 overview information follows
S: comp.object:573 \t RE: object-oriented langs \t \
"John Smith" <JSmith@xyz.com> \t Sun, 03 Nov 1996 \
14:25:05 -0800 \t <01cbc9d5f3c70$eab9a2cd@xyz.com> \
\t 4080 \t 33
S: .
Note: each field in OVER response is separated by a tab - shown as a
\t in the example above.
5.1.1. Search Formal Syntax
The search query syntax is derived from the search syntax defined for
the IMAP4 protocol. It is somewhat different because of the way inter-
national character sets need to be encoded.
The following syntax specification uses the augmented Backus-Naur Form
(BNF) notation as specified in [RFC-822]
Except as noted otherwise, all alphabetic characters are case-
insensitive. The use of upper or lower case characters to define token
strings is for editorial clarity only. Implementations MUST accept
these strings in a case-insensitive fashion.
astring ::= atom / string
atom ::= 1*ATOM_CHAR
ATOM_CHAR ::= <any CHAR except atom_specials>
atom_specials ::= "(" / ")" / SPACE / CTL / "*" / quoted_specials
CHAR ::= <any 7-bit US-ASCII character except NUL,
0x01 - 0x7f>
CTL ::= <any ASCII control character and DEL,
0x00 - 0x1f, 0x7f>
date ::= date_text / <"> date_text <">
date_day ::= 1*2digit
;; Day of month
date_month ::= "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" /
"Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec"
date_text ::= date_day "-" date_month "-" date_year
date_year ::= 4digit
digit ::= "0" / digit_nz
digit_nz ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" /
"9"
header_fld_name ::= sstring
mstring ::= A MIME-2 encoded string surrounded by double
quotes
newsgroup ::= atom [ ".*"]
newsgroups ::= "*" / newsgroup_list
newsgroup_list ::= newsgroup [ "," newsgroup_list]
number ::= 1*digit
;; Unsigned 32-bit integer
;; (0 <= n < 4,294,967,296)
nz_number ::= digit_nz *digit
;; Non-zero unsigned 32-bit integer
;; (0 < n < 4,294,967,296)
QUOTED_CHAR ::= <any TEXT_CHAR except quoted_specials> /
"\" quoted_specials
quoted_specials ::= <"> / "\"
range ::= nz_number / nz_number "-" [ nz_number ]
;; Identifies a range of Articles.
search ::= "SEARCH" SPACE ["CHARSET" SPACE astring SPACE]
["IN" SPACE newsgroups SPACE]
1#search_key
;; [CHARSET] MUST be registered with IANA
search_key ::= "ALL" / "BODY" SPACE sstring /
"FROM" SPACE sstring / "ON" SPACE date /
"SINCE" SPACE date / "BEFORE" SPACE date /
"SUBJECT" SPACE sstring / "TEXT" SPACE sstring /
"HEADER" SPACE header_fld_name SPACE sstring /
"LARGER" SPACE number / "NOT" SPACE search_key /
"OR" SPACE search_key SPACE search_key /
"SENTBEFORE" SPACE date / "SENTON" SPACE date /
"SENTSINCE" SPACE date / "SMALLER" SPACE number /
range / "(" 1#search_key ")"
SPACE ::= 1*<ASCII SP, space, 0x20>
sstring ::= astring | mstring
string ::= <"> *QUOTED_CHAR <">
TEXT_CHAR ::= <any CHAR except CR and LF>
5.2. LIST SRCHFIELDS Command
Arguments: none
Responses: 224 data follws
The LIST SRCHFIELDS command Returns a list of which fields can be
specified in full-text search queries on the server. The response is
a list of searchable fields, one per line. A "." on its own line
terminates the list. The fields are either newsgroup headers, or
non-header fields supported by the query syntax.
The three currently defined non-header fields are ":Body", ":Text", and
":Date". ":Text" means all the searchable text in the article, and
indicates that the "text" keyword is supported in the search query
language. ":Body" means the body of the article, excluding the headers,
and indicates that the "body" keyword is supported in the search query
language. ":Date" means the date at which an article arrived on a
server - similar to the date used in the NNTP NEWNEWS command - and
indicates that the "ON", "SINCE", and "BEFORE" keywords are supported in
the search query language.
The "date", "text" and "body" search query fields are optional, but the
server must indicate whether they are supported or not in the LIST
SRCHFIELDS response.
Example: C: LIST SRCHFIELDS
S: 224 Data follows.
S: From
S: Date
S: Subject
S: :Text
S: .
5.3. LIST SEARCHABLE Command
Arguments: none
Responses: 224 Data Follows
The LIST SEARECHABLE command returns a list of strings that define which
new groups are being indexed by the news server and are thus available
for searching. In addition, the character sets allowed for each group
is returned.
When there are newsgroups indexed it will return 224, followed by each
portion of the tree that is indexed. If all groups are indexed, a line
with "*" is returned. If only some parts of the newsgroup hierarchy are
indexed, they are identified in the form <indexed-hierarchy>.*. Clients
should not assume that these will always be top level hierarchies. A
"." on its own line terminates the list.
The character sets allowed in full-text searches for each entry is also
returned. The character sets are identified by the name as defined in
[MIME-1].
Example: C: LIST SEARCHABLE
S: 224 Data follows.
S: alt.* US-ASCII
S: comp.lang.* US-ASCII ISO-8859-1 ISO-8859-2
S: mcom.* ISO-8859-1
S: .
5.3 XPAT command enhancement
Arguments: header range|<message-id> pat [pat...]
Responses: <same as XPAT - see [NNTP-NEW]>
The XPAT command is enhanced in a simple way: The new value ":TEXT" will
be supported as a header when invoking the command. The :TEXT header
requests a full-text search the body and all headers of the specified
articles.
When :TEXT is specified for the header, only a single "pat" is allowed,
and it must be a word or quoted string to search for, rather than a
wildmat pattern as allowed otherwise.
If :TEXT isn't specified as the header, the response is the same as it
always has been for XPAT, with each result line containing the article
number and the value of the header that matched the pattern.
If the :TEXT header is specified, the constant string "TEXT" is returned
in place of the value of the header that matched the pattern.
Example: C: XPAT :TEXT 1000-2000 searchtext
S: 221 Header follows
S: 1021 TEXT
S: 1024 TEXT
S:.
6. Security Considerations
The search commands must be implemented in a way that does not allow
access to articles in newsgroups that a client is otherwise restricted
from reading due to access control rules.
7. Bibliography
[NNTP-977]
Network News Transfer Protocol. B. Kantor, Phil Lapsley, Request
for Comment (RFC) 977, February 1986.
[NNTP-NEW]
Network News Transfer Protocol. S. Barber INTERNET DRAFT, Sep-
tember 1996.
[IMAP4]
IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4. M Crispin,
Request for Comment (RFC) 1730, December 1994
[MIME-1]
Borenstein N., and N. Freed, MIME (Multipurpose Internet Mail
Extensions) Part One: Mechanisms for Specifying and Describing the
Format of Internet Message Bodies, <A HREF="/epub/doc/idoc/rfc/rfc1521.html">RFC 1521</A>, Bellcore, Innosoft,
September 1993.
[MIME-2]
Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two:
Message Header Extensions for Non-ASCII Text, <A HREF="/epub/doc/idoc/rfc/rfc1522.html">RFC 1522</A>, University
of Tennessee, September 1993.
8. Author's Address
Nat Ballou
Microsoft
One Microsoft Way
Redmond, WA 98052
USA
Phone: +1 206-703-0574
Email: natba@microsoft.com
This Internet Draft expires April xx, 1997.
--
Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Searching a news server
2002-01-06 23:39 ` Russ Allbery
@ 2002-01-19 20:46 ` Lars Magne Ingebrigtsen
2002-02-16 21:16 ` Russ Allbery
0 siblings, 1 reply; 5+ messages in thread
From: Lars Magne Ingebrigtsen @ 2002-01-19 20:46 UTC (permalink / raw)
Russ Allbery <rra@stanford.edu> writes:
> I have a tree that contains a hack like this, but I haven't looked to see
> how ugly it was. The following expired I-D may be of interest.
[...]
> NNTP Full-text Search Extension
I think this sounds quite interesting. If you could pull the hack out
of the tree, I could try running it on Quimby, and then implement the
commands in Gnus. Since most people who read news from Quimby uses
Gnus, that would give us a good test-bed for seeing whether this is
the right way to implement this...
--
(domestic pets only, the antidote for overdose, milk.)
larsi@gnus.org * Lars Magne Ingebrigtsen
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Searching a news server
2002-01-19 20:46 ` Lars Magne Ingebrigtsen
@ 2002-02-16 21:16 ` Russ Allbery
2002-02-17 10:03 ` Kai Großjohann
0 siblings, 1 reply; 5+ messages in thread
From: Russ Allbery @ 2002-02-16 21:16 UTC (permalink / raw)
Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
> Russ Allbery <rra@stanford.edu> writes:
>> I have a tree that contains a hack like this, but I haven't looked to see
>> how ugly it was. The following expired I-D may be of interest.
> [...]
>> NNTP Full-text Search Extension
> I think this sounds quite interesting. If you could pull the hack out
> of the tree, I could try running it on Quimby, and then implement the
> commands in Gnus. Since most people who read news from Quimby uses
> Gnus, that would give us a good test-bed for seeing whether this is the
> right way to implement this...
I went back and looked at the tree that I actually had, and unfortunately
it's considerably more complex (it does stuff like set up search profiles
and then refile articles into various special groups, and it's very ugly
and not clearly the right way to do things). So it looks like someone
trying to implement this would have to start pretty much from scratch. :/
I'll note from experience with things like this that the hard part in
making full-text searching acceptably fast will be incremental indexing of
each article as it comes in. Many of the existing search engines suck at
incremental indexing.
--
Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Searching a news server
2002-02-16 21:16 ` Russ Allbery
@ 2002-02-17 10:03 ` Kai Großjohann
0 siblings, 0 replies; 5+ messages in thread
From: Kai Großjohann @ 2002-02-17 10:03 UTC (permalink / raw)
Cc: ding
Russ Allbery <rra@stanford.edu> writes:
> I'll note from experience with things like this that the hard part in
> making full-text searching acceptably fast will be incremental indexing of
> each article as it comes in. Many of the existing search engines suck at
> incremental indexing.
Most of the weighting functions used seem to use normalization of
some kind, so the indexing weight for a given term in a given
document depends on the complete set of documents. So adding a
document means that, strictly speaking, you have to reindex the whole
collection. Hmpf.
There is a hack for freeWAIS-sf which allows you to add N documents
incrementally, with skew of weights. After more than N documents
have been added, it reindexes the whole collection. Maybe that's a
suitable workaround.
freeWAIS-sf seems to be a bear to build...
kai
--
~/.signature is: umop 3p!sdn (Frank Nobis)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2002-02-17 10:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-01-06 19:24 Searching a news server Lars Magne Ingebrigtsen
2002-01-06 23:39 ` Russ Allbery
2002-01-19 20:46 ` Lars Magne Ingebrigtsen
2002-02-16 21:16 ` Russ Allbery
2002-02-17 10:03 ` Kai Großjohann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).