adaptive word scoring

Gnus development mailing list
 help / color / mirror / Atom feed

* adaptive word scoring
@ 1996-11-29  5:25 Felix Lee
  1996-11-29  8:09 ` Kai Grossjohann
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Felix Lee @ 1996-11-29  5:25 UTC (permalink / raw)


so after using adaptive word scoring for a while, I've
decided that it's mostly useless.

say you're an avid fan of alt.sex.pictures.emacs.  the word
"gif" is fairly common and mostly neutral: you can't tell if
an article is interesting based on the word "gif".

however, adaptive scoring treats "gif" as significant in an
odd way.  if you kill a massive series of "vi pinup gif"s,
then adaptive scoring is going to reduce the score of "gif"
by an amount proportional to the number of articles you've
killed.  this significantly affects the score of those
really sexy emacs gifs.

ok, you could add "gif" to the ignored-word list, but this
is just one instance of a more general problem.

my current thoughts are:

- adaptive scoring should try to discover _useful_
  discriminants by comparing interesting v. uninteresting
  articles.  the ignored-word list should be unnecessary.

- rather than adjusting score by N for every article marked,
  marked articles should be assigned a score target, and
  adaptive-scoring elements should be adjusted to try to hit
  the target.

comments?  I'm not sure how to implement this, yet.
--


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29  5:25 adaptive word scoring Felix Lee
@ 1996-11-29  8:09 ` Kai Grossjohann
  1996-11-29 22:48   ` Felix Lee
  1996-11-29 15:45 ` Jan Vroonhof
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Kai Grossjohann @ 1996-11-29  8:09 UTC (permalink / raw)
  Cc: ding

>>>>> Felix Lee writes:

  Felix> say you're an avid fan of alt.sex.pictures.emacs.  the word
  Felix> "gif" is fairly common and mostly neutral: you can't tell if
  Felix> an article is interesting based on the word "gif".

I've never looked at word scoring, but from your description it seems
that it counts the number of occurrences.  Information Retrieval
people call this "tf" -- term frequency.

Now, the word "gif" is non-discriminative because it occurs in so many
articles -- "df" -- document frequency.

IR people deal with this problem by using tf*idf (i == inverse);
actually it's log(n/N) where n is the term frequency and N the
document frequency.

I have no idea, though, how to estimate the idf for a newsgroup.  Any
ideas, anyone?  Maybe one could look at the last 100 articles and use
the percentage of articles which have the term?

Sorry, couldn't resist...
kai
-- 
I wonder why nobody don't like me,
or is it de fact dat I'm ugly? -- Harry Belafonte

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29  8:09 ` Kai Grossjohann
@ 1996-11-29 22:48   ` Felix Lee
  1996-11-30 13:18     ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 30+ messages in thread
From: Felix Lee @ 1996-11-29 22:48 UTC (permalink / raw)


Kai Grossjohann:
> IR people deal with this problem by using tf*idf (i == inverse);
> actually it's log(n/N) where n is the term frequency and N the
> document frequency.

hmm.  something like that might help.  lemme think about it.

but offhand, it's still going to overemphasize null words
(like "version") that are medium-low frequency.

and there's still the problem that adaptive scoring in
general tends to let scores grow without bound in a
meaningless way.

> I have no idea, though, how to estimate the idf for a newsgroup.  Any

just count frequency over subjects being scored.  it's
accurate enough for this purpose.
--


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29 22:48   ` Felix Lee
@ 1996-11-30 13:18     ` Lars Magne Ingebrigtsen
  1996-12-01  8:39       ` Felix Lee
  0 siblings, 1 reply; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-11-30 13:18 UTC (permalink / raw)

Felix Lee <flee@teleport.com> writes:

> Kai Grossjohann:
> > IR people deal with this problem by using tf*idf (i == inverse);
> > actually it's log(n/N) where n is the term frequency and N the
> > document frequency.
> 
> hmm.  something like that might help.  lemme think about it.

Yes, it sounds interesting.  Mail me a patch and I'll apply it.  :-)
(Gee, aren't I nice.)

> and there's still the problem that adaptive scoring in
> general tends to let scores grow without bound in a
> meaningless way.

You could start to decay the scores.  (Now that I think about it -- I
can't recall ever getting any bug reports on score decays, which means
that nobody's using it.  :-)  Set `gnus-decay-scores' to t and see
what happens.)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-30 13:18     ` Lars Magne Ingebrigtsen
@ 1996-12-01  8:39       ` Felix Lee
  0 siblings, 0 replies; 30+ messages in thread
From: Felix Lee @ 1996-12-01  8:39 UTC (permalink / raw)


Lars:
> You could start to decay the scores.  (Now that I think about it -- I

I do decay scores.  it doesn't really help much.
--


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29  5:25 adaptive word scoring Felix Lee
  1996-11-29  8:09 ` Kai Grossjohann
@ 1996-11-29 15:45 ` Jan Vroonhof
  1996-11-30  2:28   ` Felix Lee
  1996-12-02  9:37   ` Steinar Bang
  1996-12-02  9:40 ` Wesley.Hardaker
  1996-12-02 11:46 ` Hans de Graaff
  3 siblings, 2 replies; 30+ messages in thread
From: Jan Vroonhof @ 1996-11-29 15:45 UTC (permalink / raw)


Felix Lee <flee@teleport.com> writes:

> so after using adaptive word scoring for a while, I've
> decided that it's mostly useless.

I agree. I have turned it of. I still have to figrue out away to
remove all the word rules it produced while I was still reading it.

> say you're an avid fan of alt.sex.pictures.emacs.  the word
> "gif" is fairly common and mostly neutral: you can't tell if
> an article is interesting based on the word "gif".

Same for Linux in any linux group/list. TeX and LaTeX in c.t.t. etc.

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29 15:45 ` Jan Vroonhof
@ 1996-11-30  2:28   ` Felix Lee
  1996-12-02  9:37   ` Steinar Bang
  1 sibling, 0 replies; 30+ messages in thread
From: Felix Lee @ 1996-11-30  2:28 UTC (permalink / raw)


Jan Vroonhof:
> I agree. I have turned it of. I still have to figrue out away to
> remove all the word rules it produced while I was still reading it.

write a function :)

tested with red gnus 0.52; looks like 0.71 will work.
rampantly revises your ADAPT files.  use at own risk.


(require 'cl)
(require 'gnus-score)

(defun gnus-score-forget-words ()
  "Remove all adaptive word scores."
  (gnus-score-map-adaptive-files 'gnus-score-forget-words-in-file))

(defun gnus-score-forget-words-in-file (file)
  "Remove adaptive word scores in FILE."
  (gnus-score-load-file file)
  (let ( (touched nil)
	 (prev (assoc "subject" gnus-score-alist))
	 next )
    (while (setq next (cdr prev))
      (if (not (eq (nth 3 (car next)) 'w))
	  (setq prev next)
	(setcdr prev (cdr next))
	(setq touched t)))
    (if touched
	(gnus-score-set 'touched '(t))))
  (gnus-score-save)
  (message "done %s" file))

(defun gnus-score-map-adaptive-files (func)
  "Apply FUNC to all adaptive score files."
  ;; load the score-file list
  (gnus-score-score-files "")
  (let ( (files (cdr gnus-score-file-list))
	 (suffix (concat (regexp-quote gnus-adaptive-file-suffix) "$"))
	 file )
    (while files
      (setq file (pop files))
      (if (string-match suffix file)
	  (funcall func file))
      )))


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29 15:45 ` Jan Vroonhof
  1996-11-30  2:28   ` Felix Lee
@ 1996-12-02  9:37   ` Steinar Bang
  1 sibling, 0 replies; 30+ messages in thread
From: Steinar Bang @ 1996-12-02  9:37 UTC (permalink / raw)


>>>>> Jan Vroonhof <vroonhof@math.ethz.ch>:

> Felix Lee <flee@teleport.com> writes:
>> so after using adaptive word scoring for a while, I've
>> decided that it's mostly useless.

> I agree. I have turned it of. I still have to figrue out away to
> remove all the word rules it produced while I was still reading it.

Won't they expire over time?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29  5:25 adaptive word scoring Felix Lee
  1996-11-29  8:09 ` Kai Grossjohann
  1996-11-29 15:45 ` Jan Vroonhof
@ 1996-12-02  9:40 ` Wesley.Hardaker
  1996-12-05 18:49   ` Lars Magne Ingebrigtsen
  1996-12-02 11:46 ` Hans de Graaff
  3 siblings, 1 reply; 30+ messages in thread
From: Wesley.Hardaker @ 1996-12-02  9:40 UTC (permalink / raw)

Felix Lee <flee@teleport.com> writes:

> so after using adaptive word scoring for a while, I've
> decided that it's mostly useless.

Yeah, I've found similar results.  The hard thing is that I think it
would be *really* nice to have it work if we could figure out a proper
method of scoring...

Maybe we could turn 'V w' (show word scoring) into a edit buffer where
clicking with the middle button would increase a word score and
shift-middle would lower them.  That way you could go in and muck with
the list on a per group basis.  It would also be nice to add a
'state-toggle' button as well:  active, locked (ie, don't ajust
again), incr-only, decr-only, ignore, etc???  (questionable usage).
That way you could fix the gif problem by simply 'V w', click state
button till it gets to 'ignore', C-c C-c and off you go?  No mucking
with the global word score list.  Heck!  Why not have the word list
show you statistics as well ('gif': -300, #read = 2, #catchup = 105,
etc...)  It should also display an icon when you have mail waiting.
Additionally, it should remember your birthday and sing to you when
the date was right....  uh...  oppps....  Was that outloud?

Of course, this would require a bit more coding (heh heh)...

Wes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-02  9:40 ` Wesley.Hardaker
@ 1996-12-05 18:49   ` Lars Magne Ingebrigtsen
  1996-12-06  8:18     ` Wesley.Hardaker
  0 siblings, 1 reply; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-12-05 18:49 UTC (permalink / raw)


Wesley.Hardaker@sphys.unil.ch writes:

> Maybe we could turn 'V w' (show word scoring) into a edit buffer where
> clicking with the middle button would increase a word score and
> shift-middle would lower them.  That way you could go in and muck with
> the list on a per group basis.  It would also be nice to add a
> 'state-toggle' button as well:  active, locked (ie, don't ajust
> again), incr-only, decr-only, ignore, etc???  (questionable usage).
> That way you could fix the gif problem by simply 'V w', click state
> button till it gets to 'ignore', C-c C-c and off you go?  No mucking
> with the global word score list.  Heck!  Why not have the word list
> show you statistics as well ('gif': -300, #read = 2, #catchup = 105,
> etc...)  It should also display an icon when you have mail waiting.
> Additionally, it should remember your birthday and sing to you when
> the date was right....  uh...  oppps....  Was that outloud?

Neat!  :-)

> Of course, this would require a bit more coding (heh heh)...

Nice of you to volunteer.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-05 18:49   ` Lars Magne Ingebrigtsen
@ 1996-12-06  8:18     ` Wesley.Hardaker
  0 siblings, 0 replies; 30+ messages in thread
From: Wesley.Hardaker @ 1996-12-06  8:18 UTC (permalink / raw)


>>>>> "Lars" == Lars Magne Ingebrigtsen <larsi@ifi.uio.no> writes:

Wes> Wesley.Hardaker@sphys.unil.ch writes: Maybe we could turn 'V w'
Wes> (show word scoring) into a edit buffer where clicking with the
Wes> middle button would increase a word score and shift-middle would
Wes> lower them.

Lars> Neat!  :-)

Wes> Of course, this would require a bit more coding (heh heh)...

Lars> Nice of you to volunteer.  :-)

Uh, yeah right...  I'm afraid I'm a bit swamped with other projects
right now (hence no simley docs yet).  However, I have thought about
it but it wouldn't be until 2Q97 probably...  The problem is that its
most likely over my head as well, because putting in new tags for
'ignore' and similar would require a better knowledge of the scoring
code than I have...  Lots of reading!

Wes


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-11-29  5:25 adaptive word scoring Felix Lee
                   ` (2 preceding siblings ...)
  1996-12-02  9:40 ` Wesley.Hardaker
@ 1996-12-02 11:46 ` Hans de Graaff
  1996-12-02 15:08   ` Robert Bihlmeyer
       [not found]   ` <vcn2vvixpz.fsf@totally-fudged-out-message-id>
  3 siblings, 2 replies; 30+ messages in thread
From: Hans de Graaff @ 1996-12-02 11:46 UTC (permalink / raw)

Felix Lee <flee@teleport.com> writes:

> so after using adaptive word scoring for a while, I've decided that
> it's mostly useless.

One thing that struck me is the defaults for word scoring. They cause
scores to increase of decrease at an incredible rate, making all
scores useless. I've changed the scores to -1 and 1, and this seems to
work much better. I also think that line scoring is more important
than word scoring (i.e. word scoring to extrapolate to subjects which
have not been scored yet), so setting the word scores to some small
number makes much more sense than the current defaults.

> say you're an avid fan of alt.sex.pictures.emacs.  the word "gif" is
> fairly common and mostly neutral: you can't tell if an article is
> interesting based on the word "gif".

Yes, this doesn't work well. I've noted this also in other groups
(e.g. comp.fonts), and have decided that in those cases it's no big
deal, because it means that more descriptive subjects get better
ratings.

Hans

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-02 11:46 ` Hans de Graaff
@ 1996-12-02 15:08   ` Robert Bihlmeyer
  1996-12-05 18:50     ` Lars Magne Ingebrigtsen
       [not found]   ` <vcn2vvixpz.fsf@totally-fudged-out-message-id>
  1 sibling, 1 reply; 30+ messages in thread
From: Robert Bihlmeyer @ 1996-12-02 15:08 UTC (permalink / raw)

Hi,

>>>>> On 02 Dec 1996 12:46:00 +0100
>>>>> Hans de Graaff <J.J.deGraaff@twi.tudelft.nl> said:

 Hans> Felix Lee <flee@teleport.com> writes:
 >> so after using adaptive word scoring for a while, I've decided
 >> that it's mostly useless.

 Hans> One thing that struck me is the defaults for word scoring. They
 Hans> cause scores to increase of decrease at an incredible rate,
 Hans> making all scores useless. I've changed the scores to -1 and 1,
 Hans> and this seems to work much better. I also think that line
 Hans> scoring is more important than word scoring (i.e. word scoring
 Hans> to extrapolate to subjects which have not been scored yet), so
 Hans> setting the word scores to some small number makes much more
 Hans> sense than the current defaults.

(I think I've already suggested this once:) IMHO word-scoring should
derive it's scores from the normal adaptive (line-)scores (by deviding
the line-score by the number of words in the subject). So if killed
articles would get -50 for their subject, words from the subject
should be scored with -50/(number of words) each (ignored words are
not include in this count). In this way, word-scoring mimics the
effects for line-scoring for single subjects, but has the intended
side-effects on others. Furthermore, the scores would by a magnitude
smaller than the line-scores, giving the latter priority.

	Robbe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-02 15:08   ` Robert Bihlmeyer
@ 1996-12-05 18:50     ` Lars Magne Ingebrigtsen
  1996-12-05 21:21       ` Sean Lynch
  0 siblings, 1 reply; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-12-05 18:50 UTC (permalink / raw)


Robert Bihlmeyer <e9426626@student.tuwien.ac.at> writes:

> (I think I've already suggested this once:) IMHO word-scoring should
> derive it's scores from the normal adaptive (line-)scores (by deviding
> the line-score by the number of words in the subject). So if killed
> articles would get -50 for their subject, words from the subject
> should be scored with -50/(number of words) each (ignored words are
> not include in this count). In this way, word-scoring mimics the
> effects for line-scoring for single subjects, but has the intended
> side-effects on others. Furthermore, the scores would by a magnitude
> smaller than the line-scores, giving the latter priority.

I think that sounds reasonable...  Anybody have any thoughts on this?

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-05 18:50     ` Lars Magne Ingebrigtsen
@ 1996-12-05 21:21       ` Sean Lynch
  1996-12-06 10:39         ` Lars Magne Ingebrigtsen
  1996-12-06 21:02         ` Janne Sinkkonen
  0 siblings, 2 replies; 30+ messages in thread
From: Sean Lynch @ 1996-12-05 21:21 UTC (permalink / raw)
  Cc: ding

[-- Attachment #1: Type: text/plain, Size: 2484 bytes --]

I remember reading earlier in this thread about the possibility of
rating words based on interestingness, and I think this is probably
the way to go.  The fundamental theorem of information theory tells us
that the value of any piece of information is inversely proportional
to its probability of occurrence.  Therefore, we should keep some sort
of history of the number of occurrences of each word in the adaptive
scoring criteria (i.e. the subject lines) and estimate the probability
of each word's occurrence, weighting the affect of each word on the
final score by the inverse of the probability.

Multiple occurrences of the same word in a given line probably should
not be counted, because I can't think of any situation where we'd want
to score an article differently just because a word occurred more than
once in the subject line.  However, I guess maybe people would want to
score an article lower if it had the word "free" more than once in the
subject line.

The final score given to a word would be, using the example Robert
gave, -50*(probability of the occurrence of this word)/(sum of
probabilities of all distinct words in this line)

The score of the word in the database would be adjusted by adding
(old score - new score)/c to it, where c is the speed of light.  C
could decrease over time so that scores would stabilize, though this
would cause scores to stop adapting eventually.  

Obviously, there would be some sort of thresholding function to drop
words with a large probability of occurrence.

>>>>> Sometime around 05 Dec 1996 19:50:56 +0100,
>>>>> in article <m2ral4rfjj.fsf@proletcult.slip.ifi.uio.no>,
>>>>> someone posing as Lars Magne Ingebrigtsen <larsi@ifi.uio.no> wrote:

Lars> Robert Bihlmeyer <e9426626@student.tuwien.ac.at> writes:
>> (I think I've already suggested this once:) IMHO word-scoring
>> should derive it's scores from the normal adaptive (line-)scores
>> (by deviding the line-score by the number of words in the
>> subject). So if killed articles would get -50 for their subject,
>> words from the subject should be scored with -50/(number of words)
>> each (ignored words are not include in this count). In this way,
>> word-scoring mimics the effects for line-scoring for single
>> subjects, but has the intended side-effects on others. Furthermore,
>> the scores would by a magnitude smaller than the line-scores,
>> giving the latter priority.

Lars> I think that sounds reasonable...  Anybody have any thoughts on
Lars> this?

[-- Attachment #2: Type: text/plain, Size: 228 bytes --]

Sean Lynch,  Internex Network Operations                 <noc@internex.net>
Voice: +1 408 327 2200  Fax: +1 408 496 5484  <URL:http://www.internex.net>
Technical support: <support@internex.net> <URL:http://support.internex.net>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-05 21:21       ` Sean Lynch
@ 1996-12-06 10:39         ` Lars Magne Ingebrigtsen
  1996-12-08 22:19           ` Sean Lynch
  1996-12-06 21:02         ` Janne Sinkkonen
  1 sibling, 1 reply; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-12-06 10:39 UTC (permalink / raw)


Sean Lynch <seanl@Internex.NET> writes:

> I remember reading earlier in this thread about the possibility of
> rating words based on interestingness, and I think this is probably
> the way to go.  The fundamental theorem of information theory tells us
> that the value of any piece of information is inversely proportional
> to its probability of occurrence.  Therefore, we should keep some sort
> of history of the number of occurrences of each word in the adaptive
> scoring criteria (i.e. the subject lines) and estimate the probability
> of each word's occurrence, weighting the affect of each word on the
> final score by the inverse of the probability.

Would it suffice to calculate this on the fly (from the articles
currently in the summary buffer), or does this have to be stored in a
database? 

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-06 10:39         ` Lars Magne Ingebrigtsen
@ 1996-12-08 22:19           ` Sean Lynch
  1996-12-11  0:44             ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Lynch @ 1996-12-08 22:19 UTC (permalink / raw)
  Cc: ding

[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]

>>>>> Sometime around 06 Dec 1996 11:39:51 +0100,
>>>>> in article <m2ral40xe0.fsf@proletcult.slip.ifi.uio.no>,
>>>>> someone posing as Lars Magne Ingebrigtsen <larsi@ifi.uio.no> wrote:

Lars> Sean Lynch <seanl@Internex.NET> writes:
>> I remember reading earlier in this thread about the possibility of
>> rating words based on interestingness, and I think this is probably
>> the way to go.  The fundamental theorem of information theory tells
>> us that the value of any piece of information is inversely
>> proportional to its probability of occurrence.  Therefore, we
>> should keep some sort of history of the number of occurrences of
>> each word in the adaptive scoring criteria (i.e. the subject lines)
>> and estimate the probability of each word's occurrence, weighting
>> the affect of each word on the final score by the inverse of the
>> probability.

Lars> Would it suffice to calculate this on the fly (from the articles
Lars> currently in the summary buffer), or does this have to be stored
Lars> in a database?

If we only use the articles in the current summary buffer, our
estimates of word probabilities would vary wildly depending on the
number of articles, day of the week, direction of the wind, etc.  So I
think that we should probably keep the counts in some sort of
database, unless someone has a better idea.

[-- Attachment #2: Type: text/plain, Size: 228 bytes --]

Sean Lynch,  Internex Network Operations                 <noc@internex.net>
Voice: +1 408 327 2200  Fax: +1 408 496 5484  <URL:http://www.internex.net>
Technical support: <support@internex.net> <URL:http://support.internex.net>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-08 22:19           ` Sean Lynch
@ 1996-12-11  0:44             ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-12-11  0:44 UTC (permalink / raw)


Sean Lynch <seanl@Internex.NET> writes:

> If we only use the articles in the current summary buffer, our
> estimates of word probabilities would vary wildly depending on the
> number of articles, day of the week, direction of the wind, etc.  So I
> think that we should probably keep the counts in some sort of
> database, unless someone has a better idea.

Right.  Well, this is something that'll have to wait until Mamey
Sapote Gnus in any case, so instead of doing anything about it myself,
I'll just wait for somebody who knows something about probabilities
and that, like, *math* stuff to program this.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-05 21:21       ` Sean Lynch
  1996-12-06 10:39         ` Lars Magne Ingebrigtsen
@ 1996-12-06 21:02         ` Janne Sinkkonen
  1996-12-08 22:48           ` Sean Lynch
  1 sibling, 1 reply; 30+ messages in thread
From: Janne Sinkkonen @ 1996-12-06 21:02 UTC (permalink / raw)
  Cc: Lars Magne Ingebrigtsen, ding

Sean Lynch <seanl@Internex.NET> writes:

> the way to go.  The fundamental theorem of information theory tells us
> that the value of any piece of information is inversely proportional
> to its probability of occurrence.

To the logarithm of the probability, actually. This holds as long as
the events are independent. The occurrence of words depend on the
context, but we get an approximation anyway.

> The score of the word in the database would be adjusted by adding
> (old score - new score)/c to it, where c is the speed of light.

This makes sense (given c is in appropriate units).

> C could decrease over time so that scores would stabilize, though
> this would cause scores to stop adapting eventually.

I vote against decreasing C. Instead, it should be a constant small
value, say something between 0.05 and 0.001. Reading pattern changes
etc. - the scores should adapt all the time. 

> Obviously, there would be some sort of thresholding function to drop
> words with a large probability of occurrence.

And words with small probabilities should not be in the calculations
because the probability estimates are unstable.

-- 
Janne Sinkkonen      <janne@iki.fi>      <URL: http://www.iki.fi/~janne/ >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: adaptive word scoring
  1996-12-06 21:02         ` Janne Sinkkonen
@ 1996-12-08 22:48           ` Sean Lynch
  1996-12-10 22:25             ` nnspool virtual server shows funny numbers of articles C. R. Oldham
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Lynch @ 1996-12-08 22:48 UTC (permalink / raw)
  Cc: Sean Lynch, Lars Magne Ingebrigtsen, ding

[-- Attachment #1: Type: text/plain, Size: 2866 bytes --]

>>>>> Sometime around 06 Dec 1996 23:02:09 +0200,
>>>>> in article <oak9qvgze6.fsf@avocado.pc.helsinki.fi>,
>>>>> someone posing as Janne Sinkkonen <janne@avocado.pc.helsinki.fi> wrote:

Janne> Sean Lynch <seanl@Internex.NET> writes:
>> The score of the word in the database would be adjusted by adding
>> (old score - new score)/c to it, where c is the speed of light.

Janne> This makes sense (given c is in appropriate units).

>> C could decrease over time so that scores would stabilize, though
>> this would cause scores to stop adapting eventually.

Janne> I vote against decreasing C. Instead, it should be a constant
Janne> small value, say something between 0.05 and 0.001. Reading
Janne> pattern changes etc. - the scores should adapt all the time.

Whoops, I made a booboo here... I had originally written (old score -
new score) * c, but then changed my mind and changed it to /, but I
left the following paragraph the same.  It should have been "increase
c over time" if you're dividing by c.  Either way works, though, and I
think I agree that c should remain constant (large if we divide, small
if we multiply).

>> Obviously, there would be some sort of thresholding function to
>> drop words with a large probability of occurrence.

Janne> And words with small probabilities should not be in the
Janne> calculations because the probability estimates are unstable.

It seems to me that the words with the smallest probability of
occurrence should be used because they're the most "interesting" by
the definition we're using here.  The question then becomes how to
keep the scores of such improbable words from varying wildly.

I noticed that the formula I proposed:

-50*(probability of the occurrence of this word)/(sum of probabilities
of all distinct words in this line)

does exactly the opposite of what I wanted, which was to allocate
*more* of the score to words with a *lower* probability of occurrence,
not vice-versa.  So the formula needs to be tweaked a bit, as follows,
given a line with a score of s, and distinct words w1 w2 w3 w4 w5 with
probabilities p1 p2 p3 p4 p5, the score allocated to each word would
be:

s  = s / (p * (1/p  + 1/p  + 1/p  + 1/p  + 1/p ))
 i         i      1      2      3      4      5

That formula looks familiar for some strange reason.

So, if word 1 had a 1% chance of occurring in a randomly selected
line, and each of the rest of the words each had a 5% chance, and the
score given to the current line is 50, the formula would come out to:

s  = 50 / (.01 * (1/.01 + 1/.05 + 1/.05 + 1/.05 + 1/.05))
 1

s  = 50 / (.01 * (100 + 20 + 20 + 20 + 20))
 1

s  = 50 / (.01 * 180)
 1

s  = 50 / 1.8
 1

s  = 27.8
 1

s  = 50 / (.05 * 180) = 5.6
 2

27.8 + 5.6 + 5.6 + 5.6 + 5.6 = ~50 (round off error, you know)

Phew.

That's what I wanted to get the first time, but somehow it just didn't
come out that way.

[-- Attachment #2: Type: text/plain, Size: 228 bytes --]

Sean Lynch,  Internex Network Operations                 <noc@internex.net>
Voice: +1 408 327 2200  Fax: +1 408 496 5484  <URL:http://www.internex.net>
Technical support: <support@internex.net> <URL:http://support.internex.net>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* nnspool virtual server shows funny numbers of articles.
  1996-12-08 22:48           ` Sean Lynch
@ 1996-12-10 22:25             ` C. R. Oldham
  1996-12-11  0:42               ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 30+ messages in thread
From: C. R. Oldham @ 1996-12-10 22:25 UTC (permalink / raw)


Greetings,

rgnus 0.74.  I've created a virtual server for reading my article cache as
per the instructions in the manual.  I get funny numbers of articles in
some of the groups:

K   5326: alt.comp.periphs.mainboard.asus
K      1: comp.dcom.modems
K   4107: comp.infosystems.www.servers.unix
K     49: comp.lang.perl.announce
K   8123: comp.os.ms-windows.nt.admin.misc
K      6: comp.os.ms-windows.nt.announce
K      1: comp.os.ms-windows.nt.pre-release
K      1: comp.os.ms-windows.nt.setup.hardware
K      1: comp.os.ms-windows.nt.software.compatibility
K      2: comp.security.announce
K      1: comp.security.unix
K   6722: comp.sys.ibm.pc.games.action
K   1132: gnu.emacs.gnus
K      1: gnu.emacs.sources

All of these groups have at most 5 articles in the cache.

Is this a bug or a feature?

Also, does it matter that all the groups seem to be marked as "Killed"?  I
can't set the level of the groups (nnspool doesn't support it?)

Is there a better backend to use for reading my cache?


--
| Charles R. (C. R.) Oldham     |         NCA Commission on Schools  |
| cro@nca.asu.edu               |  Arizona St. Univ., PO Box 873011  |
| V:602/965-8700 F:602/965-9423 |________      Tempe, AZ 85287-3011_ |
| "I like it!"--Citizen G'Kar, Babylon 5 | #include <disclaimer.h>X_>|


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: nnspool virtual server shows funny numbers of articles.
  1996-12-10 22:25             ` nnspool virtual server shows funny numbers of articles C. R. Oldham
@ 1996-12-11  0:42               ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-12-11  0:42 UTC (permalink / raw)


"C. R. Oldham" <cro@nca.asu.edu> writes:

> All of these groups have at most 5 articles in the cache.
> 
> Is this a bug or a feature?

The numbers displayed are MAX-NUM less MIN-NUM plus 1.  It's neither a
bug nor a feature -- it's just the way it is.

> Also, does it matter that all the groups seem to be marked as
> "Killed"?

Nope.

>  I can't set the level of the groups (nnspool doesn't support it?)

You're not in a group buffer -- you're in a browse buffer.  It looks
similary, but it's something else.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <vcn2vvixpz.fsf@totally-fudged-out-message-id>]

* Re: adaptive word scoring
       [not found]   ` <vcn2vvixpz.fsf@totally-fudged-out-message-id>
@ 1996-12-03 13:51     ` Holger Franz
  0 siblings, 0 replies; 30+ messages in thread
From: Holger Franz @ 1996-12-03 13:51 UTC (permalink / raw)



Just to toss in my two cents: How about adding the exponent feature
from the procmail scoring mechanism? It provides a flexible method to
keep scores from frequent matches low.

>From `man procmailsc`:
---8<---
Weighted regular expression conditions
       The  first  time  the regular expression is found, it will
       add w to the score.  The second time it is found, w*x will
       be  added.   The  third  time  it  is found, w*x*x will be
       added.  The fourth time w*x*x*x will  be  added.   And  so
       forth.

       This can be described by the following concise formula:

                                   n
                   n   k-1        x - 1
              w * Sum x    = w * -------
                  k=1             x - 1

       It  represents the total added score for this condition if
       n matches are found.

       Note that the following case distinctions can be made:

       x=0     Only the first match  will  contribute  w  to  the
               score.  Any subsequent matches are ignored.

       x=1     Every  match  will  contribute  the  same w to the
               score.  The score grows linearly with  the  number
               of matches found.

       0<x<1   Every match will contribute less to the score than
               the previous one.  The score  will  asymptotically
               approach  a  certain  value (see the NOTES section
               below).

       1<x     Every match will contribute more to the score than
               the  previous  one.   The score will grow exponen-
               tionally.

       x<0     Can be utilised to favour odd or  even  number  of
               matches.
---8<---

I think that could be implemented easily if one added to each rule a
counter that reflects how often the rule was already 'adapted'. There
is one major problem though: with an exponent 0<x<1 the first adaption
is likely to dominate over later adaptions. So maybe there will have
to be separate counters for positive and negative adaptions. If
positive and negative adaptions were tuned in such a way that they had
roughly the same asymptotic value, 'meaningless' words that are raised
and lowered randomly would gain a comparatively low net score.

All this sounds very alpha, but may be worth some more thought.

Holger 

-- 
Holger Franz <hfranz@physik.rwth-aachen.de>                             

Caution: feeding Gnus to your XEmacs will make it fat.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Adaptive word scoring
@ 1996-10-31  1:34 Sten Drescher
  1996-11-05 15:51 ` Robert Bihlmeyer
  1996-11-05 21:25 ` Lars Magne Ingebrigtsen
  0 siblings, 2 replies; 30+ messages in thread
From: Sten Drescher @ 1996-10-31  1:34 UTC (permalink / raw)



	I've been using adaptive word scoring, and I've noticed two
problems:

	1) You can't do a setq or defvar of the
gnus-default-adaptive-word-score-alist in .gnus, using the mark names
as shown on the info page, because they haven't been defined yet.  You
can use the numeric values resulting from setting the variable after
starting Gnus, but I'm not really comfortable doing that.

	2) All other adaptive scoring stops when you do word scoring.
Yes, I want the words adapted, but I still want the authors and
followups adapted as well.  Is there any way this can be done?


-- 
+----------------------  Tivoli Customer Support  ----------------------+
|   Sten Drescher                     Tivoli Systems, Inc               |
|   email: sten.drescher@tivoli.com   9442 Capital of Texas Hwy North   |
|   phone: (512) 794-9070             Arboretum Plaza One, Suite 500    |
|   fax  : (512) 345-2784             Austin, Texas 78759               |
+-----------------------------------------------------------------------+


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Adaptive word scoring
  1996-10-31  1:34 Adaptive " Sten Drescher
@ 1996-11-05 15:51 ` Robert Bihlmeyer
  1996-11-05 17:16   ` Per Abrahamsen
  1996-11-05 21:24   ` Lars Magne Ingebrigtsen
  1996-11-05 21:25 ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 30+ messages in thread
From: Robert Bihlmeyer @ 1996-11-05 15:51 UTC (permalink / raw)
  Cc: ding

Hi,

>>>>> On 30 Oct 1996 19:34:20 -0600
>>>>> Sten Drescher <sten.drescher@tivoli.com> said:

 Sten> 	I've been using adaptive word scoring, and I've noticed two
 Sten> problems:

 Sten> 	1) You can't do a setq or defvar of the
 Sten> gnus-default-adaptive-word-score-alist in .gnus, using the mark
 Sten> names as shown on the info page, because they haven't been
 Sten> defined yet.

You can do a "require" to load the file that has the marks defined,
but this is still a kludge.

Sometimes I wish that there were a load hook for every feature.

 Sten> 	2) All other adaptive scoring stops when you do word scoring.
 Sten> Yes, I want the words adapted, but I still want the authors and
 Sten> followups adapted as well. Is there any way this can be done?

(setq gnus-use-adaptive-scoring '(word line))

	Robbe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Adaptive word scoring
  1996-11-05 15:51 ` Robert Bihlmeyer
@ 1996-11-05 17:16   ` Per Abrahamsen
  1996-11-05 21:24   ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 30+ messages in thread
From: Per Abrahamsen @ 1996-11-05 17:16 UTC (permalink / raw)


Robert Bihlmeyer <e9426626@student.tuwien.ac.at> writes:

> Sometimes I wish that there were a load hook for every feature.

There is for every file.  It is called `eval-after-load'.  

It is possible to confuse it though, so XEmacs has not enabled it by
default.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Adaptive word scoring
  1996-11-05 15:51 ` Robert Bihlmeyer
  1996-11-05 17:16   ` Per Abrahamsen
@ 1996-11-05 21:24   ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-11-05 21:24 UTC (permalink / raw)


Robert Bihlmeyer <e9426626@student.tuwien.ac.at> writes:

> Sometimes I wish that there were a load hook for every feature.

I wish that if a variable was mentioned (say, `gnus-mark-whatever'),
then the package the variable was defined in would be autoloaded.  

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Adaptive word scoring
  1996-10-31  1:34 Adaptive " Sten Drescher
  1996-11-05 15:51 ` Robert Bihlmeyer
@ 1996-11-05 21:25 ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-11-05 21:25 UTC (permalink / raw)


Sten Drescher <sten.drescher@tivoli.com> writes:

> 	1) You can't do a setq or defvar of the
> gnus-default-adaptive-word-score-alist in .gnus, using the mark names
> as shown on the info page, because they haven't been defined yet.  You
> can use the numeric values resulting from setting the variable after
> starting Gnus, but I'm not really comfortable doing that.

Put some `(require 'whatever)'s into the .gnus.el file.

> 	2) All other adaptive scoring stops when you do word scoring.
> Yes, I want the words adapted, but I still want the authors and
> followups adapted as well.  Is there any way this can be done?

This has been fixed in 0.54, I believe.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Adaptive word scoring
@ 1996-08-04  2:57 Lars Magne Ingebrigtsen
  1996-08-04 17:19 ` François Pinard
  0 siblings, 1 reply; 30+ messages in thread
From: Lars Magne Ingebrigtsen @ 1996-08-04  2:57 UTC (permalink / raw)



I want to implement adaptive scoring on words.  Since this will
generate a *lot* of score rules, I think I have to write a new match
method -- `w'.  One could use string search (to inaccurate), regexp
search with "\bword\b" (too slow), split all subjects into words and
put them in a list and use `member' (too slow), intern them and use
`memq' (better, but too slow), split into words and put in a buffer
and use search for "\nword\n" (slowish).  So I think I'll split into
words and use a hash table, which seems to be the fastest way.  This
means that the words'll have to be downcased, so one can't then have
case-sensitive word matches.

Yes.  Does anybody have a list of common English "small words" --
"and", "the", etc., that should be excluded from adaptiation?  This
should be configurable, of course.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Ingebrigtsen


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Adaptive word scoring
  1996-08-04  2:57 Lars Magne Ingebrigtsen
@ 1996-08-04 17:19 ` François Pinard
  0 siblings, 0 replies; 30+ messages in thread
From: François Pinard @ 1996-08-04 17:19 UTC (permalink / raw)
  Cc: ding

[-- Attachment #1: Type: text/plain, Size: 343 bytes --]

Lars Magne Ingebrigtsen <larsi@ifi.uio.no> writes:
   
   Yes.  Does anybody have a list of common English "small words" --
   "and", "the", etc., that should be excluded from adaptiation?  This
   should be configurable, of course.

GNU ptx distributes this following list for ignorable words, yet it does not
install nor use it by default.


[-- Attachment #2: eign --]
[-- Type: application/octet-stream, Size: 775 bytes --]

a
about
after
against
all
also
an
and
another
any
are
as
at
back
be
because
been
before
being
between
both
but
by
came
can
come
could
current
day
did
do
down
each
end
even
first
for
from
get
go
good
great
had
has
have
he
her
here
him
his
how
i
if
in
into
is
it
its
just
know
last
life
like
little
long
made
make
man
many
may
me
men
might
more
most
mr
much
must
my
name
never
new
no
not
now
of
off
old
on
one
only
or
other
our
out
over
own
part
people
point
right
said
same
say
see
she
should
since
so
some
start
state
still
such
take
than
that
the
their
them
then
there
these
they
this
those
three
through
time
to
too
true
try
two
under
up
us
use
used
value
very
was
way
we
well
were
what
when
where
which
while
who
why
will
with
without
work
world
would
year
years
you
your

[-- Attachment #3: Type: text/plain, Size: 155 bytes --]


-- 
François Pinard         ``Vivement GNU!''        pinard@iro.umontreal.ca
Support Programming Freedom, join our League!  Ask lpf@lpf.org for info!

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~1996-12-11  0:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1996-11-29  5:25 adaptive word scoring Felix Lee
1996-11-29  8:09 ` Kai Grossjohann
1996-11-29 22:48   ` Felix Lee
1996-11-30 13:18     ` Lars Magne Ingebrigtsen
1996-12-01  8:39       ` Felix Lee
1996-11-29 15:45 ` Jan Vroonhof
1996-11-30  2:28   ` Felix Lee
1996-12-02  9:37   ` Steinar Bang
1996-12-02  9:40 ` Wesley.Hardaker
1996-12-05 18:49   ` Lars Magne Ingebrigtsen
1996-12-06  8:18     ` Wesley.Hardaker
1996-12-02 11:46 ` Hans de Graaff
1996-12-02 15:08   ` Robert Bihlmeyer
1996-12-05 18:50     ` Lars Magne Ingebrigtsen
1996-12-05 21:21       ` Sean Lynch
1996-12-06 10:39         ` Lars Magne Ingebrigtsen
1996-12-08 22:19           ` Sean Lynch
1996-12-11  0:44             ` Lars Magne Ingebrigtsen
1996-12-06 21:02         ` Janne Sinkkonen
1996-12-08 22:48           ` Sean Lynch
1996-12-10 22:25             ` nnspool virtual server shows funny numbers of articles C. R. Oldham
1996-12-11  0:42               ` Lars Magne Ingebrigtsen
     [not found]   ` <vcn2vvixpz.fsf@totally-fudged-out-message-id>
1996-12-03 13:51     ` adaptive word scoring Holger Franz
  -- strict thread matches above, loose matches on Subject: below --
1996-10-31  1:34 Adaptive " Sten Drescher
1996-11-05 15:51 ` Robert Bihlmeyer
1996-11-05 17:16   ` Per Abrahamsen
1996-11-05 21:24   ` Lars Magne Ingebrigtsen
1996-11-05 21:25 ` Lars Magne Ingebrigtsen
1996-08-04  2:57 Lars Magne Ingebrigtsen
1996-08-04 17:19 ` François Pinard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).