Announcements and discussions for Gnus, the GNU Emacs Usenet newsreader
 help / color / mirror / Atom feed
* Recognizing repeats in RSS feeds
@ 2009-01-16 18:12 Desmond Rivet
  2009-01-16 21:08 ` Robert D. Crawford
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Desmond Rivet @ 2009-01-16 18:12 UTC (permalink / raw)
  To: info-gnus-english

Hi all,

In addition to reading news and email, I use Gnus to keep track of
various RSS feeds.

For some of these feeds, certain articles will, over time, show up
repeatedly in my summary list.  I'm not sure why, but I assume it has
something to do with updates to the article itself.  Or maybe it happens
when someone posts a new comment on the article.  I don't know.

I have threading enabled on these RSS groups, so the repeated articles
at least get put under one thread, which is good.  However, they still
show up as "new" articles in my group.  I have to go in the group and
verify that the article is a repeat.  It's a bit of a pain.

Is there any way to score a repeated (updated) article down, so that
they wouldn't show up in my group unless I asked?  I have no idea where
to even start with this; a simple push in the right direction would be
appreciated.

Thanks in advance.

-- 
Desmond Rivet

Pain is weakness leaving the body.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-16 18:12 Recognizing repeats in RSS feeds Desmond Rivet
@ 2009-01-16 21:08 ` Robert D. Crawford
  2009-01-16 22:05 ` Ted Zlatanov
  2009-01-22  3:15 ` Mark Plaksin
  2 siblings, 0 replies; 8+ messages in thread
From: Robert D. Crawford @ 2009-01-16 21:08 UTC (permalink / raw)
  To: info-gnus-english

Desmond Rivet <desmond_news@videotron.ca> writes:

> For some of these feeds, certain articles will, over time, show up
> repeatedly in my summary list.  I'm not sure why, but I assume it has
> something to do with updates to the article itself.  Or maybe it happens
> when someone posts a new comment on the article.  I don't know.

I'm not sure it is any of these, but it could be.  It happens to me on
_many_ feeds.

> Is there any way to score a repeated (updated) article down, so that
> they wouldn't show up in my group unless I asked?  I have no idea where
> to even start with this; a simple push in the right direction would be
> appreciated.

What I did was to change my reading habit in these groups a bit.
Instead of just letting the article be marked as read I kill the
article.  In my SCORE file I have it set to drop the score whenever I
kill the article and have mark-and-expunge set to -1.  

There might be a better or cleaner way to do this but it works.

rdc
-- 
Robert D. Crawford                                      rdc1x@comcast.net

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-16 18:12 Recognizing repeats in RSS feeds Desmond Rivet
  2009-01-16 21:08 ` Robert D. Crawford
@ 2009-01-16 22:05 ` Ted Zlatanov
  2009-01-21  1:22   ` Desmond Rivet
  2009-01-22  3:15 ` Mark Plaksin
  2 siblings, 1 reply; 8+ messages in thread
From: Ted Zlatanov @ 2009-01-16 22:05 UTC (permalink / raw)
  To: info-gnus-english

On Fri, 16 Jan 2009 13:12:37 -0500 Desmond Rivet <desmond_news@videotron.ca> wrote: 

DR> In addition to reading news and email, I use Gnus to keep track of
DR> various RSS feeds.

DR> For some of these feeds, certain articles will, over time, show up
DR> repeatedly in my summary list.  I'm not sure why, but I assume it has
DR> something to do with updates to the article itself.  Or maybe it happens
DR> when someone posts a new comment on the article.  I don't know.
...
DR> Is there any way to score a repeated (updated) article down, so that
DR> they wouldn't show up in my group unless I asked?  I have no idea where
DR> to even start with this; a simple push in the right direction would be
DR> appreciated.

You want to ignore updates which only affect irrelevant fields.  Here's
how I do it:

(setq nnrss-ignore-article-fields '(description slash:comments slash:hit_parade))

This works for me to eliminate duplicates completely; "description"
changes very frequently on some sites for instance.  nnrss finds unique
articles by taking all their fields that are not ignored and hashing the
content.

To find out exactly what's happening, set gnus-verbose to 10 and refresh
a nnrss group.  You have to have a recent CVS Gnus to use this.  I added
it fairly recently.  In *Messages* you'll see a full dump of the RSS
segment that describes each article, and from that you can easily figure
out what's causing duplicates.

For example, here's one entry from the Dilbert Blog:

nnrss: Making hash index of (item nil "
" (title nil "From Blog to Reality: Three Interesting Things") "
" (link nil "http://dilbert.com/blog/entry/from_blog_to_reality_three_things/") "
" (description nil "...cut because it's too much text...") "
" (pubDate nil "Fri, 16 Jan 2009 01:00:01 PST") "
" (guid ((isPermaLink . "false")) "http://dilbert.com/blog/entry/203/") "
")

So the fields here are guid, pubDate, title, link, and description.

If you need more help, tell us what feeds specifically are causing the
problem and I can take a look.

Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-16 22:05 ` Ted Zlatanov
@ 2009-01-21  1:22   ` Desmond Rivet
  2009-01-21  7:21     ` Adam Sjøgren
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Desmond Rivet @ 2009-01-21  1:22 UTC (permalink / raw)
  To: info-gnus-english

Ted Zlatanov <tzz@lifelogs.com> writes:

> On Fri, 16 Jan 2009 13:12:37 -0500 Desmond Rivet
> <desmond_news@videotron.ca> wrote:
>
> DR> In addition to reading news and email, I use Gnus to keep track of
> DR> various RSS feeds.
>
> DR> For some of these feeds, certain articles will, over time, show up
> DR> repeatedly in my summary list.  I'm not sure why, but I assume it has
> DR> something to do with updates to the article itself.  Or maybe it happens
> DR> when someone posts a new comment on the article.  I don't know.
> ...
> DR> Is there any way to score a repeated (updated) article down, so that
> DR> they wouldn't show up in my group unless I asked?  I have no idea where
> DR> to even start with this; a simple push in the right direction would be
> DR> appreciated.
>
> You want to ignore updates which only affect irrelevant fields.  Here's
> how I do it:
>
> (setq nnrss-ignore-article-fields '(description slash:comments slash:hit_parade))
>
> This works for me to eliminate duplicates completely; "description"
> changes very frequently on some sites for instance.  nnrss finds unique
> articles by taking all their fields that are not ignored and hashing the
> content.
>
> To find out exactly what's happening, set gnus-verbose to 10 and refresh
> a nnrss group.  You have to have a recent CVS Gnus to use this.  I added
> it fairly recently.  In *Messages* you'll see a full dump of the RSS
> segment that describes each article, and from that you can easily figure
> out what's causing duplicates.
>
> For example, here's one entry from the Dilbert Blog:
>
> nnrss: Making hash index of (item nil "
> " (title nil "From Blog to Reality: Three Interesting Things") "
> " (link nil "http://dilbert.com/blog/entry/from_blog_to_reality_three_things/") "
> " (description nil "...cut because it's too much text...") "
> " (pubDate nil "Fri, 16 Jan 2009 01:00:01 PST") "
> " (guid ((isPermaLink . "false")) "http://dilbert.com/blog/entry/203/") "
> ")
>
> So the fields here are guid, pubDate, title, link, and description.
>
> If you need more help, tell us what feeds specifically are causing the
> problem and I can take a look.

Thanks for the reply. However, I'm somewhat confused (not by your
directions, but rather by what I'm seeing)

So, I've started examining my RSS feeds. I'll use Slashdot as an example
since alot of people read it.

What I did was the following :

1. made a backup of the directory that stores my downloaded rss feeds.

2. waited until my Slashdot group was updated and I got a repeated item.

3. compared a selected item from the saved backup Slashdot rss file to a
selected item from the current Slashdot rss file.  If I understand how
this works, there should be some sort of textual difference between the
old item and the new, yes?

(this is all very low tech, bear with me)

So, I picked a item at random from the current rss file, pasted the xml
fragment into a buffer, did the same with the saved rss file, and did a
diff.  I get the following:

11,12c11,12
< <slash:comments>770</slash:comments>
< <slash:hit_parade>770,762,595,490,138,86,71</slash:hit_parade>
---
> <slash:comments>757</slash:comments>
> <slash:hit_parade>757,749,587,482,133,83,69</slash:hit_parade>

So far, so good.  This tells me that the slash:comments and
slash:hit_parade fields are the culprits, right? So I do this in my
.gnus.el:

(setq nnrss-ignore-article-fields '(slash:comments slash:hit_parade))

And restart emacs.

However, I *still* get spurious updates of the same article in Slashdot.
So I take your advice and do this:

(setq gnus-verbose 10)

And hit M-g in Slashdot.  Picking another article at random, I see this:

nnrss: Making hash index of (item ((rdf:about . "http://it.slashdot.org/article.pl?sid=09/01/20/1930252&from=rss")) "
" (title nil "Largest Data Breach Disclosed During Inauguration") "
" (link nil "http://rss.slashdot.org/~r/Slashdot/slashdot/~3/iHBmFGKE504/article.pl") "
" (description nil "rmogull writes \"Brian Krebs over at <snip>") "
" (dc:creator nil "kdawson") "
" (dc:date nil "2009-01-20T19:44:00+00:00") "
" (dc:subject nil "security") "
" (slash:department nil "debit-cards-at-risk") "
" (slash:section nil "it") "
" (slash:comments nil "121") "
" (slash:hit_parade nil "121,117,99,80,24,16,13") "
" (feedburner:origLink nil "http://it.slashdot.org/article.pl?sid=09%2F01%2F20%2F1930252&from=rss"))

Note the presence of slash:comments and slash:hit_parade.  Am I to
understand that the slash:comments and slash:hit_parade fields are still
contributing to the hash?

I should mention I'm using GNU Emacs 23.0.60.1.

Thanks in advance for any insight!

-- 
Desmond Rivet

Pain is weakness leaving the body.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-21  1:22   ` Desmond Rivet
@ 2009-01-21  7:21     ` Adam Sjøgren
  2009-01-21 19:38     ` Desmond Rivet
  2009-01-21 21:46     ` Ted Zlatanov
  2 siblings, 0 replies; 8+ messages in thread
From: Adam Sjøgren @ 2009-01-21  7:21 UTC (permalink / raw)
  To: info-gnus-english

On Tue, 20 Jan 2009 20:22:16 -0500, Desmond wrote:

> Note the presence of slash:comments and slash:hit_parade.  Am I to
> understand that the slash:comments and slash:hit_parade fields are still
> contributing to the hash?

What is shown is all fields, before the ignored fields are removed; see:

  http://article.gmane.org/gmane.emacs.gnus.general/67806/


  Best regards,

-- 
 "Remember, Robert, in life anything can happen."             Adam Sjøgren
                                                         asjo@koldfront.dk

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-21  1:22   ` Desmond Rivet
  2009-01-21  7:21     ` Adam Sjøgren
@ 2009-01-21 19:38     ` Desmond Rivet
  2009-01-21 21:46     ` Ted Zlatanov
  2 siblings, 0 replies; 8+ messages in thread
From: Desmond Rivet @ 2009-01-21 19:38 UTC (permalink / raw)
  To: info-gnus-english

Desmond Rivet <desmond_news@videotron.ca> writes:
>
> Thanks for the reply. However, I'm somewhat confused (not by your
> directions, but rather by what I'm seeing)

Errr...I think I found the problem. I was doing this:

(setq nnrss-ignore-article-field '(slash:comments slash:hit_parade))

Note the missing 's'.  It should be this:

(setq nnrss-ignore-article-fields '(slash:comments slash:hit_parade))

I am very embarassed.  Sorry for wasting everyone's time. Things appear
to be working better now :)

-- 
Desmond Rivet

Pain is weakness leaving the body.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-21  1:22   ` Desmond Rivet
  2009-01-21  7:21     ` Adam Sjøgren
  2009-01-21 19:38     ` Desmond Rivet
@ 2009-01-21 21:46     ` Ted Zlatanov
  2 siblings, 0 replies; 8+ messages in thread
From: Ted Zlatanov @ 2009-01-21 21:46 UTC (permalink / raw)
  To: info-gnus-english

On Tue, 20 Jan 2009 20:22:16 -0500 Desmond Rivet <desmond_news@videotron.ca> wrote: 

DR> Note the presence of slash:comments and slash:hit_parade.  Am I to
DR> understand that the slash:comments and slash:hit_parade fields are still
DR> contributing to the hash?

Adam answered, but I just wanted to explain the reasoning.

I debated this, but decided to show the article before removing those
fields.  They don't contribute to the hash, but could be important to
the user, especially if the user wants to know if they can be
re-enabled.

On Wed, 21 Jan 2009 14:38:29 -0500 Desmond Rivet <desmond_news@videotron.ca> wrote: 

DR> Errr...I think I found the problem. I was doing this:

DR> (setq nnrss-ignore-article-field '(slash:comments slash:hit_parade))

DR> Note the missing 's'.  It should be this:

DR> (setq nnrss-ignore-article-fields '(slash:comments slash:hit_parade))

DR> I am very embarassed.  Sorry for wasting everyone's time. Things appear
DR> to be working better now :)

I'm glad you found the issue.  This is why I always recommend using
Customize at first--you would have noticed there was no such variable.

More importantly, things work now :)  I know it's frustrating to have
duplicate RSS entries.

Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Recognizing repeats in RSS feeds
  2009-01-16 18:12 Recognizing repeats in RSS feeds Desmond Rivet
  2009-01-16 21:08 ` Robert D. Crawford
  2009-01-16 22:05 ` Ted Zlatanov
@ 2009-01-22  3:15 ` Mark Plaksin
  2 siblings, 0 replies; 8+ messages in thread
From: Mark Plaksin @ 2009-01-22  3:15 UTC (permalink / raw)
  To: info-gnus-english

Desmond Rivet <desmond_news@videotron.ca> writes:

> Hi all,
>
> In addition to reading news and email, I use Gnus to keep track of
> various RSS feeds.
>
> For some of these feeds, certain articles will, over time, show up
> repeatedly in my summary list.  I'm not sure why, but I assume it has
> something to do with updates to the article itself.  Or maybe it happens
> when someone posts a new comment on the article.  I don't know.

FWIW, I recently switched from nnrss to nnshimbun (part of emacs-w3m)
and this problem has essentially disappeared.  I'm very happy with the
switch.  Here's a blog entry by the guy who recently added
shimbun-use-local which allows you to fetch feeds (and other shimbuns)
via an external script:  http://www.randomsample.de/dru5/node/45

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-01-22  3:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-16 18:12 Recognizing repeats in RSS feeds Desmond Rivet
2009-01-16 21:08 ` Robert D. Crawford
2009-01-16 22:05 ` Ted Zlatanov
2009-01-21  1:22   ` Desmond Rivet
2009-01-21  7:21     ` Adam Sjøgren
2009-01-21 19:38     ` Desmond Rivet
2009-01-21 21:46     ` Ted Zlatanov
2009-01-22  3:15 ` Mark Plaksin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).