From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/9138
Path: main.gmane.org!not-for-mail
From: Sean Lynch <seanl@Internex.NET>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: adaptive word scoring
Date: 08 Dec 1996 14:48:38 -0800
Sender: seanl@Internex.NET
Message-ID: <rhsenh0r6t5.fsf@internex.net>
References: <199611290525.VAA00464@kim.teleport.com>  <yahlobh2mpz.fsf@twi.tudelft.nl> <9612021508.AA23722@stud2.tuwien.ac.at> <m2ral4rfjj.fsf@proletcult.slip.ifi.uio.no> <rhsafrsn0uu.fsf@internex.net> <oak9qvgze6.fsf@avocado.pc.helsinki.fi>
NNTP-Posting-Host: coloc-standby.netfonds.no
Mime-Version: 1.0 (generated by tm-edit 7.88)
Content-Type: multipart/mixed;
 boundary="Multipart_Sun_Dec__8_14:48:38_1996-1"
Content-Transfer-Encoding: 7bit
X-Trace: main.gmane.org 1035149207 16662 80.91.224.250 (20 Oct 2002 21:26:47 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Sun, 20 Oct 2002 21:26:47 +0000 (UTC)
Cc: Sean Lynch <seanl@Internex.NET>,
        Lars Magne Ingebrigtsen <larsi@ifi.uio.no>, ding@ifi.uio.no
Return-Path: <ding-request@ifi.uio.no>
Original-Received: from ifi.uio.no (0@ifi.uio.no [129.240.64.2])
          by deanna.miranova.com (8.8.4/8.8.4) with SMTP
	  id PAA09274 for <steve@miranova.com>; Sun, 8 Dec 1996 15:01:00 -0800
Original-Received: from starfleet.Internex.NET (starfleet.internex.net [199.2.14.11]) by ifi.uio.no with ESMTP (8.6.11/ifi2.4) 
	id <XAA21980@ifi.uio.no> ; Sun, 8 Dec 1996 23:48:49 +0100
Original-Received: from Earth.Internex.NET (earth.internex.net [199.2.13.16]) by starfleet.Internex.NET (8.8.2/8.8.0) with ESMTP id OAA06230; Sun, 8 Dec 1996 14:48:39 -0800 (PST)
Original-Received: (from seanl@localhost) by Earth.Internex.NET (8.8.2/8.8.2) id OAA07819; Sun, 8 Dec 1996 14:48:39 -0800 (PST)
Original-To: Janne Sinkkonen <janne@avocado.pc.helsinki.fi>
In-Reply-To: Janne Sinkkonen's message of 06 Dec 1996 23:02:09 +0200
Original-Lines: 92
X-Mailer: Red Gnus v0.73/XEmacs 19.14
Xref: main.gmane.org gmane.emacs.gnus.general:9138
X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:9138

--Multipart_Sun_Dec__8_14:48:38_1996-1
Content-Type: text/plain; charset=US-ASCII

>>>>> Sometime around 06 Dec 1996 23:02:09 +0200,
>>>>> in article <oak9qvgze6.fsf@avocado.pc.helsinki.fi>,
>>>>> someone posing as Janne Sinkkonen <janne@avocado.pc.helsinki.fi> wrote:

Janne> Sean Lynch <seanl@Internex.NET> writes:
>> The score of the word in the database would be adjusted by adding
>> (old score - new score)/c to it, where c is the speed of light.

Janne> This makes sense (given c is in appropriate units).

>> C could decrease over time so that scores would stabilize, though
>> this would cause scores to stop adapting eventually.

Janne> I vote against decreasing C. Instead, it should be a constant
Janne> small value, say something between 0.05 and 0.001. Reading
Janne> pattern changes etc. - the scores should adapt all the time.

Whoops, I made a booboo here... I had originally written (old score -
new score) * c, but then changed my mind and changed it to /, but I
left the following paragraph the same.  It should have been "increase
c over time" if you're dividing by c.  Either way works, though, and I
think I agree that c should remain constant (large if we divide, small
if we multiply).

>> Obviously, there would be some sort of thresholding function to
>> drop words with a large probability of occurrence.

Janne> And words with small probabilities should not be in the
Janne> calculations because the probability estimates are unstable.

It seems to me that the words with the smallest probability of
occurrence should be used because they're the most "interesting" by
the definition we're using here.  The question then becomes how to
keep the scores of such improbable words from varying wildly.

I noticed that the formula I proposed:

-50*(probability of the occurrence of this word)/(sum of probabilities
of all distinct words in this line)

does exactly the opposite of what I wanted, which was to allocate
*more* of the score to words with a *lower* probability of occurrence,
not vice-versa.  So the formula needs to be tweaked a bit, as follows,
given a line with a score of s, and distinct words w1 w2 w3 w4 w5 with
probabilities p1 p2 p3 p4 p5, the score allocated to each word would
be:

s  = s / (p * (1/p  + 1/p  + 1/p  + 1/p  + 1/p ))
 i         i      1      2      3      4      5

That formula looks familiar for some strange reason.

So, if word 1 had a 1% chance of occurring in a randomly selected
line, and each of the rest of the words each had a 5% chance, and the
score given to the current line is 50, the formula would come out to:

s  = 50 / (.01 * (1/.01 + 1/.05 + 1/.05 + 1/.05 + 1/.05))
 1

s  = 50 / (.01 * (100 + 20 + 20 + 20 + 20))
 1

s  = 50 / (.01 * 180)
 1

s  = 50 / 1.8
 1

s  = 27.8
 1

s  = 50 / (.05 * 180) = 5.6
 2

27.8 + 5.6 + 5.6 + 5.6 + 5.6 = ~50 (round off error, you know)

Phew.

That's what I wanted to get the first time, but somehow it just didn't
come out that way.

--Multipart_Sun_Dec__8_14:48:38_1996-1
Content-Type: text/plain; charset=US-ASCII

Sean Lynch,  Internex Network Operations                 <noc@internex.net>
Voice: +1 408 327 2200  Fax: +1 408 496 5484  <URL:http://www.internex.net>
Technical support: <support@internex.net> <URL:http://support.internex.net>

--Multipart_Sun_Dec__8_14:48:38_1996-1--