Gnus development mailing list
 help / color / mirror / Atom feed
* Multiple counts of words in adaptive scoring
@ 2000-09-17 13:59 Hans de Graaff
  2000-09-18 22:37 ` Kai Großjohann
  0 siblings, 1 reply; 2+ messages in thread
From: Hans de Graaff @ 2000-09-17 13:59 UTC (permalink / raw)


Hi,

I am using adaptive word scoring. It seems that if the same scorable
word occurs several more than once in a subject line, it is also
counted more than once. I don't think this is the desired effect, and
perhaps each word should only be counted once, which seems more
intuitively right.

Thoughts?

Hans





^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Multiple counts of words in adaptive scoring
  2000-09-17 13:59 Multiple counts of words in adaptive scoring Hans de Graaff
@ 2000-09-18 22:37 ` Kai Großjohann
  0 siblings, 0 replies; 2+ messages in thread
From: Kai Großjohann @ 2000-09-18 22:37 UTC (permalink / raw)
  Cc: ding

On 17 Sep 2000, Hans de Graaff wrote:

> I am using adaptive word scoring. It seems that if the same scorable
> word occurs several more than once in a subject line, it is also
> counted more than once. I don't think this is the desired effect,
> and perhaps each word should only be counted once, which seems more
> intuitively right.

Hm.  Hmmm...  Yes, maybe for such short documents taking term
frequency into account is not a good idea.  But document frequency
(percentage of documents containing the term) should be taken into
account.

The traditional way to do term weighting is tf*idf, where tf stands
for term frequency (how often does term A occur in document D1), and
idf stands for inverse document frequency (how many documents contain
term A).  I think that the actual formula has a logarithm in it, but
I'm not sure if it's log(tf*idf) or tf*(log(idf)).

tf*idf weighting is discussed in many standard Information Retrieval
textbooks. 

The big question is: what is the total number of documents?  Ie, what
documents do we take into account?  All the postings shown in the
summary buffer?  All the postings available on the server?  In that
group only, or in all groups, or on thematically related groups?

Would be interesting to know whether somebody has done some research
on this.

kai
-- 
I like BOTH kinds of music.



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2000-09-18 22:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-09-17 13:59 Multiple counts of words in adaptive scoring Hans de Graaff
2000-09-18 22:37 ` Kai Großjohann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).