>>>>> Sometime around 06 Dec 1996 23:02:09 +0200, >>>>> in article , >>>>> someone posing as Janne Sinkkonen wrote: Janne> Sean Lynch writes: >> The score of the word in the database would be adjusted by adding >> (old score - new score)/c to it, where c is the speed of light. Janne> This makes sense (given c is in appropriate units). >> C could decrease over time so that scores would stabilize, though >> this would cause scores to stop adapting eventually. Janne> I vote against decreasing C. Instead, it should be a constant Janne> small value, say something between 0.05 and 0.001. Reading Janne> pattern changes etc. - the scores should adapt all the time. Whoops, I made a booboo here... I had originally written (old score - new score) * c, but then changed my mind and changed it to /, but I left the following paragraph the same. It should have been "increase c over time" if you're dividing by c. Either way works, though, and I think I agree that c should remain constant (large if we divide, small if we multiply). >> Obviously, there would be some sort of thresholding function to >> drop words with a large probability of occurrence. Janne> And words with small probabilities should not be in the Janne> calculations because the probability estimates are unstable. It seems to me that the words with the smallest probability of occurrence should be used because they're the most "interesting" by the definition we're using here. The question then becomes how to keep the scores of such improbable words from varying wildly. I noticed that the formula I proposed: -50*(probability of the occurrence of this word)/(sum of probabilities of all distinct words in this line) does exactly the opposite of what I wanted, which was to allocate *more* of the score to words with a *lower* probability of occurrence, not vice-versa. So the formula needs to be tweaked a bit, as follows, given a line with a score of s, and distinct words w1 w2 w3 w4 w5 with probabilities p1 p2 p3 p4 p5, the score allocated to each word would be: s = s / (p * (1/p + 1/p + 1/p + 1/p + 1/p )) i i 1 2 3 4 5 That formula looks familiar for some strange reason. So, if word 1 had a 1% chance of occurring in a randomly selected line, and each of the rest of the words each had a 5% chance, and the score given to the current line is 50, the formula would come out to: s = 50 / (.01 * (1/.01 + 1/.05 + 1/.05 + 1/.05 + 1/.05)) 1 s = 50 / (.01 * (100 + 20 + 20 + 20 + 20)) 1 s = 50 / (.01 * 180) 1 s = 50 / 1.8 1 s = 27.8 1 s = 50 / (.05 * 180) = 5.6 2 27.8 + 5.6 + 5.6 + 5.6 + 5.6 = ~50 (round off error, you know) Phew. That's what I wanted to get the first time, but somehow it just didn't come out that way.