>>>>> Sometime around 06 Dec 1996 23:02:09 +0200,
>>>>> in article <oak9qvgze6.fsf@avocado.pc.helsinki.fi>,
>>>>> someone posing as Janne Sinkkonen <janne@avocado.pc.helsinki.fi> wrote:

Janne> Sean Lynch <seanl@Internex.NET> writes:
>> The score of the word in the database would be adjusted by adding
>> (old score - new score)/c to it, where c is the speed of light.

Janne> This makes sense (given c is in appropriate units).

>> C could decrease over time so that scores would stabilize, though
>> this would cause scores to stop adapting eventually.

Janne> I vote against decreasing C. Instead, it should be a constant
Janne> small value, say something between 0.05 and 0.001. Reading
Janne> pattern changes etc. - the scores should adapt all the time.

Whoops, I made a booboo here... I had originally written (old score -
new score) * c, but then changed my mind and changed it to /, but I
left the following paragraph the same.  It should have been "increase
c over time" if you're dividing by c.  Either way works, though, and I
think I agree that c should remain constant (large if we divide, small
if we multiply).

>> Obviously, there would be some sort of thresholding function to
>> drop words with a large probability of occurrence.

Janne> And words with small probabilities should not be in the
Janne> calculations because the probability estimates are unstable.

It seems to me that the words with the smallest probability of
occurrence should be used because they're the most "interesting" by
the definition we're using here.  The question then becomes how to
keep the scores of such improbable words from varying wildly.

I noticed that the formula I proposed:

-50*(probability of the occurrence of this word)/(sum of probabilities
of all distinct words in this line)

does exactly the opposite of what I wanted, which was to allocate
*more* of the score to words with a *lower* probability of occurrence,
not vice-versa.  So the formula needs to be tweaked a bit, as follows,
given a line with a score of s, and distinct words w1 w2 w3 w4 w5 with
probabilities p1 p2 p3 p4 p5, the score allocated to each word would
be:

s  = s / (p * (1/p  + 1/p  + 1/p  + 1/p  + 1/p ))
 i         i      1      2      3      4      5

That formula looks familiar for some strange reason.

So, if word 1 had a 1% chance of occurring in a randomly selected
line, and each of the rest of the words each had a 5% chance, and the
score given to the current line is 50, the formula would come out to:

s  = 50 / (.01 * (1/.01 + 1/.05 + 1/.05 + 1/.05 + 1/.05))
 1

s  = 50 / (.01 * (100 + 20 + 20 + 20 + 20))
 1

s  = 50 / (.01 * 180)
 1

s  = 50 / 1.8
 1

s  = 27.8
 1

s  = 50 / (.05 * 180) = 5.6
 2

27.8 + 5.6 + 5.6 + 5.6 + 5.6 = ~50 (round off error, you know)

Phew.

That's what I wanted to get the first time, but somehow it just didn't
come out that way.