From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/9138 Path: main.gmane.org!not-for-mail From: Sean Lynch Newsgroups: gmane.emacs.gnus.general Subject: Re: adaptive word scoring Date: 08 Dec 1996 14:48:38 -0800 Sender: seanl@Internex.NET Message-ID: References: <199611290525.VAA00464@kim.teleport.com> <9612021508.AA23722@stud2.tuwien.ac.at> NNTP-Posting-Host: coloc-standby.netfonds.no Mime-Version: 1.0 (generated by tm-edit 7.88) Content-Type: multipart/mixed; boundary="Multipart_Sun_Dec__8_14:48:38_1996-1" Content-Transfer-Encoding: 7bit X-Trace: main.gmane.org 1035149207 16662 80.91.224.250 (20 Oct 2002 21:26:47 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 20 Oct 2002 21:26:47 +0000 (UTC) Cc: Sean Lynch , Lars Magne Ingebrigtsen , ding@ifi.uio.no Return-Path: Original-Received: from ifi.uio.no (0@ifi.uio.no [129.240.64.2]) by deanna.miranova.com (8.8.4/8.8.4) with SMTP id PAA09274 for ; Sun, 8 Dec 1996 15:01:00 -0800 Original-Received: from starfleet.Internex.NET (starfleet.internex.net [199.2.14.11]) by ifi.uio.no with ESMTP (8.6.11/ifi2.4) id ; Sun, 8 Dec 1996 23:48:49 +0100 Original-Received: from Earth.Internex.NET (earth.internex.net [199.2.13.16]) by starfleet.Internex.NET (8.8.2/8.8.0) with ESMTP id OAA06230; Sun, 8 Dec 1996 14:48:39 -0800 (PST) Original-Received: (from seanl@localhost) by Earth.Internex.NET (8.8.2/8.8.2) id OAA07819; Sun, 8 Dec 1996 14:48:39 -0800 (PST) Original-To: Janne Sinkkonen In-Reply-To: Janne Sinkkonen's message of 06 Dec 1996 23:02:09 +0200 Original-Lines: 92 X-Mailer: Red Gnus v0.73/XEmacs 19.14 Xref: main.gmane.org gmane.emacs.gnus.general:9138 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:9138 --Multipart_Sun_Dec__8_14:48:38_1996-1 Content-Type: text/plain; charset=US-ASCII >>>>> Sometime around 06 Dec 1996 23:02:09 +0200, >>>>> in article , >>>>> someone posing as Janne Sinkkonen wrote: Janne> Sean Lynch writes: >> The score of the word in the database would be adjusted by adding >> (old score - new score)/c to it, where c is the speed of light. Janne> This makes sense (given c is in appropriate units). >> C could decrease over time so that scores would stabilize, though >> this would cause scores to stop adapting eventually. Janne> I vote against decreasing C. Instead, it should be a constant Janne> small value, say something between 0.05 and 0.001. Reading Janne> pattern changes etc. - the scores should adapt all the time. Whoops, I made a booboo here... I had originally written (old score - new score) * c, but then changed my mind and changed it to /, but I left the following paragraph the same. It should have been "increase c over time" if you're dividing by c. Either way works, though, and I think I agree that c should remain constant (large if we divide, small if we multiply). >> Obviously, there would be some sort of thresholding function to >> drop words with a large probability of occurrence. Janne> And words with small probabilities should not be in the Janne> calculations because the probability estimates are unstable. It seems to me that the words with the smallest probability of occurrence should be used because they're the most "interesting" by the definition we're using here. The question then becomes how to keep the scores of such improbable words from varying wildly. I noticed that the formula I proposed: -50*(probability of the occurrence of this word)/(sum of probabilities of all distinct words in this line) does exactly the opposite of what I wanted, which was to allocate *more* of the score to words with a *lower* probability of occurrence, not vice-versa. So the formula needs to be tweaked a bit, as follows, given a line with a score of s, and distinct words w1 w2 w3 w4 w5 with probabilities p1 p2 p3 p4 p5, the score allocated to each word would be: s = s / (p * (1/p + 1/p + 1/p + 1/p + 1/p )) i i 1 2 3 4 5 That formula looks familiar for some strange reason. So, if word 1 had a 1% chance of occurring in a randomly selected line, and each of the rest of the words each had a 5% chance, and the score given to the current line is 50, the formula would come out to: s = 50 / (.01 * (1/.01 + 1/.05 + 1/.05 + 1/.05 + 1/.05)) 1 s = 50 / (.01 * (100 + 20 + 20 + 20 + 20)) 1 s = 50 / (.01 * 180) 1 s = 50 / 1.8 1 s = 27.8 1 s = 50 / (.05 * 180) = 5.6 2 27.8 + 5.6 + 5.6 + 5.6 + 5.6 = ~50 (round off error, you know) Phew. That's what I wanted to get the first time, but somehow it just didn't come out that way. --Multipart_Sun_Dec__8_14:48:38_1996-1 Content-Type: text/plain; charset=US-ASCII Sean Lynch, Internex Network Operations Voice: +1 408 327 2200 Fax: +1 408 496 5484 Technical support: --Multipart_Sun_Dec__8_14:48:38_1996-1--