From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/9133 Path: main.gmane.org!not-for-mail From: Janne Sinkkonen Newsgroups: gmane.emacs.gnus.general Subject: Re: adaptive word scoring Date: 06 Dec 1996 23:02:09 +0200 Message-ID: References: <199611290525.VAA00464@kim.teleport.com> <9612021508.AA23722@stud2.tuwien.ac.at> NNTP-Posting-Host: coloc-standby.netfonds.no X-Trace: main.gmane.org 1035149203 16641 80.91.224.250 (20 Oct 2002 21:26:43 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 20 Oct 2002 21:26:43 +0000 (UTC) Cc: Lars Magne Ingebrigtsen , ding@ifi.uio.no Return-Path: Original-Received: from ifi.uio.no (0@ifi.uio.no [129.240.64.2]) by deanna.miranova.com (8.8.4/8.8.4) with SMTP id NAA00913 for ; Fri, 6 Dec 1996 13:30:43 -0800 Original-Received: from avocado.pc.helsinki.fi (janne@avocado.pc.helsinki.fi [128.214.75.66]) by ifi.uio.no with ESMTP (8.6.11/ifi2.4) id ; Fri, 6 Dec 1996 22:02:46 +0100 Original-Received: (from janne@localhost) by avocado.pc.helsinki.fi (8.6.10/8.6.9) id XAA12523; Fri, 6 Dec 1996 23:02:09 +0200 Original-To: Sean Lynch In-Reply-To: Sean Lynch's message of 05 Dec 1996 13:21:45 -0800 Original-Lines: 30 X-Mailer: Red Gnus v0.62/XEmacs 19.14 Xref: main.gmane.org gmane.emacs.gnus.general:9133 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:9133 Sean Lynch writes: > the way to go. The fundamental theorem of information theory tells us > that the value of any piece of information is inversely proportional > to its probability of occurrence. To the logarithm of the probability, actually. This holds as long as the events are independent. The occurrence of words depend on the context, but we get an approximation anyway. > The score of the word in the database would be adjusted by adding > (old score - new score)/c to it, where c is the speed of light. This makes sense (given c is in appropriate units). > C could decrease over time so that scores would stabilize, though > this would cause scores to stop adapting eventually. I vote against decreasing C. Instead, it should be a constant small value, say something between 0.05 and 0.001. Reading pattern changes etc. - the scores should adapt all the time. > Obviously, there would be some sort of thresholding function to drop > words with a large probability of occurrence. And words with small probabilities should not be in the calculations because the probability estimates are unstable. -- Janne Sinkkonen