From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/9133
Path: main.gmane.org!not-for-mail
From: Janne Sinkkonen <janne@avocado.pc.helsinki.fi>
Newsgroups: gmane.emacs.gnus.general
Subject: Re: adaptive word scoring
Date: 06 Dec 1996 23:02:09 +0200
Message-ID: <oak9qvgze6.fsf@avocado.pc.helsinki.fi>
References: <199611290525.VAA00464@kim.teleport.com>  <yahlobh2mpz.fsf@twi.tudelft.nl> <9612021508.AA23722@stud2.tuwien.ac.at> <m2ral4rfjj.fsf@proletcult.slip.ifi.uio.no> <rhsafrsn0uu.fsf@internex.net>
NNTP-Posting-Host: coloc-standby.netfonds.no
X-Trace: main.gmane.org 1035149203 16641 80.91.224.250 (20 Oct 2002 21:26:43 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Sun, 20 Oct 2002 21:26:43 +0000 (UTC)
Cc: Lars Magne Ingebrigtsen <larsi@ifi.uio.no>, ding@ifi.uio.no
Return-Path: <ding-request@ifi.uio.no>
Original-Received: from ifi.uio.no (0@ifi.uio.no [129.240.64.2])
          by deanna.miranova.com (8.8.4/8.8.4) with SMTP
	  id NAA00913 for <steve@miranova.com>; Fri, 6 Dec 1996 13:30:43 -0800
Original-Received: from avocado.pc.helsinki.fi (janne@avocado.pc.helsinki.fi [128.214.75.66]) by ifi.uio.no with ESMTP (8.6.11/ifi2.4) 
	id <WAA06837@ifi.uio.no> ; Fri, 6 Dec 1996 22:02:46 +0100
Original-Received: (from janne@localhost) by avocado.pc.helsinki.fi (8.6.10/8.6.9) id XAA12523; Fri, 6 Dec 1996 23:02:09 +0200
Original-To: Sean Lynch <seanl@Internex.NET>
In-Reply-To: Sean Lynch's message of 05 Dec 1996 13:21:45 -0800
Original-Lines: 30
X-Mailer: Red Gnus v0.62/XEmacs 19.14
Xref: main.gmane.org gmane.emacs.gnus.general:9133
X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:9133

Sean Lynch <seanl@Internex.NET> writes:

> the way to go.  The fundamental theorem of information theory tells us
> that the value of any piece of information is inversely proportional
> to its probability of occurrence.

To the logarithm of the probability, actually. This holds as long as
the events are independent. The occurrence of words depend on the
context, but we get an approximation anyway.

> The score of the word in the database would be adjusted by adding
> (old score - new score)/c to it, where c is the speed of light.

This makes sense (given c is in appropriate units).

> C could decrease over time so that scores would stabilize, though
> this would cause scores to stop adapting eventually.

I vote against decreasing C. Instead, it should be a constant small
value, say something between 0.05 and 0.001. Reading pattern changes
etc. - the scores should adapt all the time. 

> Obviously, there would be some sort of thresholding function to drop
> words with a large probability of occurrence.

And words with small probabilities should not be in the calculations
because the probability estimates are unstable.

-- 
Janne Sinkkonen      <janne@iki.fi>      <URL: http://www.iki.fi/~janne/ >