Re: adaptive word scoring

Gnus development mailing list
 help / color / mirror / Atom feed

From: Sean Lynch <seanl@Internex.NET>
Cc: Sean Lynch <seanl@Internex.NET>,
	Lars Magne Ingebrigtsen <larsi@ifi.uio.no>,
	ding@ifi.uio.no
Subject: Re: adaptive word scoring
Date: 08 Dec 1996 14:48:38 -0800	[thread overview]
Message-ID: <rhsenh0r6t5.fsf@internex.net> (raw)
In-Reply-To: Janne Sinkkonen's message of 06 Dec 1996 23:02:09 +0200

[-- Attachment #1: Type: text/plain, Size: 2866 bytes --]

>>>>> Sometime around 06 Dec 1996 23:02:09 +0200,
>>>>> in article <oak9qvgze6.fsf@avocado.pc.helsinki.fi>,
>>>>> someone posing as Janne Sinkkonen <janne@avocado.pc.helsinki.fi> wrote:

Janne> Sean Lynch <seanl@Internex.NET> writes:
>> The score of the word in the database would be adjusted by adding
>> (old score - new score)/c to it, where c is the speed of light.

Janne> This makes sense (given c is in appropriate units).

>> C could decrease over time so that scores would stabilize, though
>> this would cause scores to stop adapting eventually.

Janne> I vote against decreasing C. Instead, it should be a constant
Janne> small value, say something between 0.05 and 0.001. Reading
Janne> pattern changes etc. - the scores should adapt all the time.

Whoops, I made a booboo here... I had originally written (old score -
new score) * c, but then changed my mind and changed it to /, but I
left the following paragraph the same.  It should have been "increase
c over time" if you're dividing by c.  Either way works, though, and I
think I agree that c should remain constant (large if we divide, small
if we multiply).

>> Obviously, there would be some sort of thresholding function to
>> drop words with a large probability of occurrence.

Janne> And words with small probabilities should not be in the
Janne> calculations because the probability estimates are unstable.

It seems to me that the words with the smallest probability of
occurrence should be used because they're the most "interesting" by
the definition we're using here.  The question then becomes how to
keep the scores of such improbable words from varying wildly.

I noticed that the formula I proposed:

-50*(probability of the occurrence of this word)/(sum of probabilities
of all distinct words in this line)

does exactly the opposite of what I wanted, which was to allocate
*more* of the score to words with a *lower* probability of occurrence,
not vice-versa.  So the formula needs to be tweaked a bit, as follows,
given a line with a score of s, and distinct words w1 w2 w3 w4 w5 with
probabilities p1 p2 p3 p4 p5, the score allocated to each word would
be:

s  = s / (p * (1/p  + 1/p  + 1/p  + 1/p  + 1/p ))
 i         i      1      2      3      4      5

That formula looks familiar for some strange reason.

So, if word 1 had a 1% chance of occurring in a randomly selected
line, and each of the rest of the words each had a 5% chance, and the
score given to the current line is 50, the formula would come out to:

s  = 50 / (.01 * (1/.01 + 1/.05 + 1/.05 + 1/.05 + 1/.05))
 1

s  = 50 / (.01 * (100 + 20 + 20 + 20 + 20))
 1

s  = 50 / (.01 * 180)
 1

s  = 50 / 1.8
 1

s  = 27.8
 1

s  = 50 / (.05 * 180) = 5.6
 2

27.8 + 5.6 + 5.6 + 5.6 + 5.6 = ~50 (round off error, you know)

Phew.

That's what I wanted to get the first time, but somehow it just didn't
come out that way.

[-- Attachment #2: Type: text/plain, Size: 228 bytes --]

Sean Lynch,  Internex Network Operations                 <noc@internex.net>
Voice: +1 408 327 2200  Fax: +1 408 496 5484  <URL:http://www.internex.net>
Technical support: <support@internex.net> <URL:http://support.internex.net>

next prev parent reply	other threads:[~1996-12-08 22:48 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1996-11-29  5:25 Felix Lee
1996-11-29  8:09 ` Kai Grossjohann
1996-11-29 22:48   ` Felix Lee
1996-11-30 13:18     ` Lars Magne Ingebrigtsen
1996-12-01  8:39       ` Felix Lee
1996-11-29 15:45 ` Jan Vroonhof
1996-11-30  2:28   ` Felix Lee
1996-12-02  9:37   ` Steinar Bang
1996-12-02  9:40 ` Wesley.Hardaker
1996-12-05 18:49   ` Lars Magne Ingebrigtsen
1996-12-06  8:18     ` Wesley.Hardaker
1996-12-02 11:46 ` Hans de Graaff
1996-12-02 15:08   ` Robert Bihlmeyer
1996-12-05 18:50     ` Lars Magne Ingebrigtsen
1996-12-05 21:21       ` Sean Lynch
1996-12-06 10:39         ` Lars Magne Ingebrigtsen
1996-12-08 22:19           ` Sean Lynch
1996-12-11  0:44             ` Lars Magne Ingebrigtsen
1996-12-06 21:02         ` Janne Sinkkonen
1996-12-08 22:48           ` Sean Lynch [this message]
1996-12-10 22:25             ` nnspool virtual server shows funny numbers of articles C. R. Oldham
1996-12-11  0:42               ` Lars Magne Ingebrigtsen
     [not found]   ` <vcn2vvixpz.fsf@totally-fudged-out-message-id>
1996-12-03 13:51     ` adaptive word scoring Holger Franz
  -- strict thread matches above, loose matches on Subject: below --
1996-10-31  1:34 Adaptive " Sten Drescher
1996-11-05 15:51 ` Robert Bihlmeyer
1996-11-05 17:16   ` Per Abrahamsen
1996-11-05 21:24   ` Lars Magne Ingebrigtsen
1996-11-05 21:25 ` Lars Magne Ingebrigtsen
1996-08-04  2:57 Lars Magne Ingebrigtsen
1996-08-04 17:19 ` François Pinard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=rhsenh0r6t5.fsf@internex.net \
    --to=seanl@internex.net \
    --cc=ding@ifi.uio.no \
    --cc=larsi@ifi.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).