Gnus development mailing list
 help / color / mirror / Atom feed
* duplicate items in nnrss groups
@ 2006-09-10 20:25 Jochen Küpper
  2006-09-10 21:48 ` David Hansen
  0 siblings, 1 reply; 10+ messages in thread
From: Jochen Küpper @ 2006-09-10 20:25 UTC (permalink / raw)


I use Gnus to read some rss feeds using the configuration shown below.

I believe there is nothing special about the config, the only
parameter I see related to my problem is nnrss-use-local (= t).

With this I get repeatedly the same entries listed as new. This seems
to happen every time I press "g" *and* the date/time of the xml-file
is newer than at the previous update, or something similar.

Has anybody else seen that? Is this a know problem? Is there any cure
to it?

If this is not known, how would one go ahead and debug this?

Thanks for any help/hints.

,----[from .gnus]
| (require 'nnrss)
| (setq nnrss-use-local t)
| (add-hook 'gnus-summary-mode-hook
|           (lambda ()
|             (if (string-match "nnrss:" gnus-newsgroup-name)
|                 (progn
|                   (make-local-variable 'gnus-show-threads)
|                   (make-local-variable 'gnus-article-sort-functions)
|                   (make-local-variable 'gnus-use-adaptive-scoring)
|                   (make-local-variable 'gnus-use-scoring)
|                   (make-local-variable 'gnus-score-find-score-files-function)
|                   (make-local-variable 'gnus-summary-line-format)
|                   (setq gnus-show-threads nil)
|                   (setq gnus-article-sort-functions 'gnus-article-sort-by-number)
|                   (setq gnus-use-adaptive-scoring nil)
|                   (setq gnus-use-scoring t)
|                   (setq gnus-score-find-score-files-function 'gnus-score-find-single)
|                   (setq gnus-summary-line-format "%U%R%z%[%d%] %I%( %s%)\n")))))
| (defun jk/browse-nnrss-url(arg)
|   (interactive "p")
|   (let ((url (assq nnrss-url-field
|                    (mail-header-extra
|                     (gnus-data-header
|                      (assq (gnus-summary-article-number)
|                            gnus-newsgroup-data))))))
|     (if url
|         (progn
|           (browse-url (cdr url))
|           (gnus-summary-mark-as-read-forward 1))
|       (gnus-summary-scroll-up arg))))
| (add-to-list 'nnmail-extra-headers nnrss-url-field)
| (add-hook 'gnus-summary-mode-hook
|           (lambda ()
|             (if (string-match "nnrss" gnus-newsgroup-name)
|                 (define-key gnus-summary-mode-map (kbd "<RET>") 'jk/browse-nnrss-url))))
`----

One feed where I always see that is the following:
,----
| <?xml version="1.0" encoding="ISO-8859-1"?>
| <?xml-stylesheet href="http://www.tagesspiegel.de/feed/rss.css" type="text/css"?>
| <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
| 
| <channel>
|   <title>Tagesspiegel Online: Nachrichten</title>
|   <link>http://www.tagesspiegel.de</link>
|   <language>de</language>
|   <description>Nachrichten für Berlin und Deutschland</description>
|   <copyright>Urban Media GmbH</copyright>
|   <image>
|   	<title>Tagesspiegel Online: Nachrichten</title>
| 	<url>http://www.tagesspiegel.de/feed/images/tslogo.png</url>
| 	<link>http://www.tagesspiegel.de</link>
|   </image>
| </channel>
`----

Greetings,
Jochen
-- 
Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
    Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
        (Part 3 you find in my messages before fall 2003.)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
  2006-09-10 20:25 duplicate items in nnrss groups Jochen Küpper
@ 2006-09-10 21:48 ` David Hansen
       [not found]   ` <87slizbdgb.fsf-/PVU1wbgskXSxOb6xA1WPVLmfr9GSJok@public.gmane.org>
  2006-09-14 16:51   ` Mark Plaksin
  0 siblings, 2 replies; 10+ messages in thread
From: David Hansen @ 2006-09-10 21:48 UTC (permalink / raw)


On Sun, 10 Sep 2006 22:25:04 +0200 Jochen Küpper wrote:

> I use Gnus to read some rss feeds using the configuration shown below.
>
> I believe there is nothing special about the config, the only
> parameter I see related to my problem is nnrss-use-local (= t).
>
> With this I get repeatedly the same entries listed as new. This seems
> to happen every time I press "g" *and* the date/time of the xml-file
> is newer than at the previous update, or something similar.
>
> Has anybody else seen that? Is this a know problem? Is there any cure
> to it?

It's a known problem.  Do a gmane search.  So far every
discussion "just ended".  IMHO before fixing this there must
be some kind of agreement which RSS tags should be used to
build the message id.

If some RSS "expert" is reading this i would be glad to here
which tags really indicate some new content and which are just
more or less useless additions like the <slash:*> tags.

Then it should be straight forward to code a fix.

David




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
       [not found]   ` <87slizbdgb.fsf-/PVU1wbgskXSxOb6xA1WPVLmfr9GSJok@public.gmane.org>
@ 2006-09-11  8:07     ` Jochen Küpper
  0 siblings, 0 replies; 10+ messages in thread
From: Jochen Küpper @ 2006-09-11  8:07 UTC (permalink / raw)


On 10. Sep. 2006, David Hansen wrote:

> On Sun, 10 Sep 2006 22:25:04 +0200 Jochen Küpper wrote:
>
>> the only parameter I see related to my problem is nnrss-use-local
>> (= t).
>>
>> With this I get repeatedly the same entries listed as new. This seems
>> to happen every time I press "g" *and* the date/time of the xml-file
>> is newer than at the previous update, or something similar.

> IMHO before fixing this there must be some kind of agreement which
> RSS tags should be used to build the message id.
>
> If some RSS "expert" is reading this i would be glad to here which
> tags really indicate some new content and which are just more or
> less useless additions like the <slash:*> tags.

I guess I was to tired when I wrote my previous email, sorry.

Anyway, I read the different discussions when the appeared on the
list. Last time it was suggested to use the time instead of the full
message for hashing to decide on new content. This is supposed to work
around problems like added comments and so forth...

Now I look at my problem and the complete messages in the xml file are
unchanged! No new tags added, no content changed. 
But Gnus does, nevertheless, create a different hash, apparently
because it incorporates the date of the file on disk into the hash!

Thus, even a "touch feed.xml" on my local disk will "create" new
entries in the RSS group. This is something, I believe, that we should
get right (i.e. not use the date of the file).

Disclaimer: I might have no idea what I am talking about with respect
to RSS and nnrss.

Greetings,
Jochen
-- 
Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
    Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
        (Part 3 you find in my messages before fall 2003.)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
  2006-09-10 21:48 ` David Hansen
       [not found]   ` <87slizbdgb.fsf-/PVU1wbgskXSxOb6xA1WPVLmfr9GSJok@public.gmane.org>
@ 2006-09-14 16:51   ` Mark Plaksin
       [not found]     ` <wszmd2xuh3.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Mark Plaksin @ 2006-09-14 16:51 UTC (permalink / raw)


David Hansen <david.hansen@gmx.net> writes:

> On Sun, 10 Sep 2006 22:25:04 +0200 Jochen Küpper wrote:
>
>> I use Gnus to read some rss feeds using the configuration shown below.
>>
>> I believe there is nothing special about the config, the only
>> parameter I see related to my problem is nnrss-use-local (= t).
>>
>> With this I get repeatedly the same entries listed as new. This seems
>> to happen every time I press "g" *and* the date/time of the xml-file
>> is newer than at the previous update, or something similar.
>>
>> Has anybody else seen that? Is this a know problem? Is there any cure
>> to it?
>
> It's a known problem.  Do a gmane search.  So far every
> discussion "just ended".  IMHO before fixing this there must
> be some kind of agreement which RSS tags should be used to
> build the message id.
>
> If some RSS "expert" is reading this i would be glad to here
> which tags really indicate some new content and which are just
> more or less useless additions like the <slash:*> tags.
>
> Then it should be straight forward to code a fix.

Perhaps no expert is chiming in because everybody thinks the current plan
is great:  Use the title, description, and pubdate.  As far as I can tell
that addresses the concerns raised in the first place:
http://thread.gmane.org/gmane.emacs.gnus.general/62613/focus=62613




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
       [not found]     ` <wszmd2xuh3.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
@ 2006-09-14 19:40       ` Jochen Küpper
  2006-09-14 20:50         ` Mark Plaksin
  0 siblings, 1 reply; 10+ messages in thread
From: Jochen Küpper @ 2006-09-14 19:40 UTC (permalink / raw)


On 14. Sep. 2006, Mark Plaksin wrote:

> David Hansen <david.hansen-hi6Y0CQ0nG0@public.gmane.org> writes:
>
>> On Sun, 10 Sep 2006 22:25:04 +0200 Jochen Küpper wrote:

>>> With this I get repeatedly the same entries listed as new. This
>>> seems cto happen every time I press "g" *and* the date/time of the
>>> xml-file is newer than at the previous update, or something
>>> similar.

>> If some RSS "expert" is reading this i would be glad to here which
>> tags really indicate some new content and which are just more or
>> less useless additions like the <slash:*> tags.

> Perhaps no expert is chiming in because everybody thinks the current
> plan is great: Use the title, description, and pubdate.

Why do you add pubdate here? 
The fact that it is used is the problem I describe above, not?

Greetings,
Jochen
-- 
Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
    Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
        (Part 3 you find in my messages before fall 2003.)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
  2006-09-14 19:40       ` Jochen Küpper
@ 2006-09-14 20:50         ` Mark Plaksin
       [not found]           ` <wsejuexjds.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Plaksin @ 2006-09-14 20:50 UTC (permalink / raw)


Jochen Küpper <jochen@fhi-berlin.mpg.de>
writes:

> On 14. Sep. 2006, Mark Plaksin wrote:
>
>> David Hansen <david.hansen@gmx.net> writes:
>>
>>> On Sun, 10 Sep 2006 22:25:04 +0200 Jochen Küpper wrote:
>
>>>> With this I get repeatedly the same entries listed as new. This
>>>> seems cto happen every time I press "g" *and* the date/time of the
>>>> xml-file is newer than at the previous update, or something
>>>> similar.
>
>>> If some RSS "expert" is reading this i would be glad to here which
>>> tags really indicate some new content and which are just more or
>>> less useless additions like the <slash:*> tags.
>
>> Perhaps no expert is chiming in because everybody thinks the current
>> plan is great: Use the title, description, and pubdate.
>
> Why do you add pubdate here? 
> The fact that it is used is the problem I describe above, not?

Maybe I misunderstood.  I thought the problem you described was because
caused by nnrss using the timestamp of the feed's cache ("feed name.el") on
your local disk.  By "pubdate" I meant the date included in each item in
the RSS feed itself.

Depending on the RSS version (I'm no expert) it might be "pubdate" or
"dc:date".  For example, here's Gmane's RSS item for your article:

  <item rdf:about="http://permalink.gmane.org/gmane.emacs.gnus.general/63744">
    <title>Re: duplicate items in nnrss groups</title>
    <link>http://permalink.gmane.org/gmane.emacs.gnus.general/63744</link>
    <description>

Why do you add pubdate here? 
The fact that it is used is the problem I describe above, not?

Greetings,
Jochen
</description>
    <dc:creator>Jochen Küpper</dc:creator>
    <dc:date>2006-09-14T19:40:37</dc:date>
  </item>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
       [not found]           ` <wsejuexjds.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
@ 2006-09-14 21:34             ` Jochen Küpper
  2006-09-15 16:12               ` Mark Plaksin
  0 siblings, 1 reply; 10+ messages in thread
From: Jochen Küpper @ 2006-09-14 21:34 UTC (permalink / raw)


On 14. Sep. 2006, Mark Plaksin wrote:

> Depending on the RSS version (I'm no expert) it might be "pubdate" or
> "dc:date".

Maybe that's the problem. 

The feeds I just checked don't seem to specify a date that way at all,
for example:
,----
|  <item rdf:about="http://link.aip.org/link/?JCPSA6/125/109902/1&amp;agg=rss">
|     <title>Publisher's Note: ``Capillary waves at the liquid-vapor interface and the surface tension of water'' [J. Chem. P
| hys. [bold 125], 014702 (2006)]</title>
|     <link>http://link.aip.org/link/?JCPSA6/125/109902/1&amp;agg=rss</link>
|     <description>Ahmed E. Ismail, Gary S. Grest, and Mark J. Stevens&lt;br/&gt;   ... [J. Chem. Phys. 125, 109902 (2006)] p
| ublished Thu Sep 14, 2006.</description>
| </item>
`----
Maybe only then the date of the xml-file is used as item-date?

Greetings,
Jochen
-- 
Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
    Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
        (Part 3 you find in my messages before fall 2003.)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
  2006-09-14 21:34             ` Jochen Küpper
@ 2006-09-15 16:12               ` Mark Plaksin
       [not found]                 ` <wsmz91un18.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Plaksin @ 2006-09-15 16:12 UTC (permalink / raw)


Jochen Küpper <jochen@fhi-berlin.mpg.de>
writes:

> On 14. Sep. 2006, Mark Plaksin wrote:
>
>> Depending on the RSS version (I'm no expert) it might be "pubdate" or
>> "dc:date".
>
> Maybe that's the problem. 
>
> The feeds I just checked don't seem to specify a date that way at all,
> for example:
> ,----
> |  <item rdf:about="http://link.aip.org/link/?JCPSA6/125/109902/1&amp;agg=rss">
> |     <title>Publisher's Note: ``Capillary waves at the liquid-vapor interface and the surface tension of water'' [J. Chem. P
> | hys. [bold 125], 014702 (2006)]</title>
> |     <link>http://link.aip.org/link/?JCPSA6/125/109902/1&amp;agg=rss</link>
> |     <description>Ahmed E. Ismail, Gary S. Grest, and Mark J. Stevens&lt;br/&gt;   ... [J. Chem. Phys. 125, 109902 (2006)] p
> | ublished Thu Sep 14, 2006.</description>
> | </item>
> `----
> Maybe only then the date of the xml-file is used as item-date?

I don't see where nnrss.el is using the timestamp of an XML file.  Maybe I
missed it.  Did you find it already or are you just saying it *seems* like
it's using the timestamp?

It looks like it's simply using the md5 of the whole article so it's
surprising that it would consider two instances of the item above to be
different.  But again, maybe I've missed something in the code. 

It sounds like you're fetching the feeds outside of nnrss and then nnrss is
reading them.  Is that right?  In my case nnrss is fetching the feeds.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: duplicate items in nnrss groups
       [not found]                 ` <wsmz91un18.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
@ 2006-09-16 13:02                   ` Jochen Küpper
       [not found]                     ` <9e64fo9d7v.fsf-X+QEHg5KIgm/8B4OpmtwqPxnRIzENc/G@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jochen Küpper @ 2006-09-16 13:02 UTC (permalink / raw)


Mark Plaksin wrote on 15. Sep. 2006:

> I don't see where nnrss.el is using the timestamp of an XML file.
> Maybe I missed it. Did you find it already or are you just saying it
> *seems* like it's using the timestamp?

It seems like it: The articles do have a Date: entry in the *Article*
buffer, which is the timestamp of the file.

> It looks like it's simply using the md5 of the whole article so it's
> surprising that it would consider two instances of the item above to
> be different. But again, maybe I've missed something in the code.

Somewhere it must be generating the Date: header based on the
timestamp of the file.

> It sounds like you're fetching the feeds outside of nnrss and then
> nnrss is reading them. Is that right?

Yes, I have
,----[ C-h v nnrss-use-local RET ]
| nnrss-use-local is a variable defined in `nnrss.el'.
| Its value is t
| 
| Documentation:
| Not documented as a variable.
| 
| [back]
`----

Greetings,
Jochen
-- 
Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
    Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
        (Part 3 you find in my messages before fall 2003.)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* generated dates should not be hashed (was: duplicate items in nnrss groups)
       [not found]                     ` <9e64fo9d7v.fsf-X+QEHg5KIgm/8B4OpmtwqPxnRIzENc/G@public.gmane.org>
@ 2006-09-16 13:54                       ` Jochen Küpper
  0 siblings, 0 replies; 10+ messages in thread
From: Jochen Küpper @ 2006-09-16 13:54 UTC (permalink / raw)


Ok, I think I have spotted the problem with the duplicate entries I
encounter in nnrss feeds:

For the feeds without a date specified, in nnrss-normalize-date the
,----
|     (cond ((null date))
`----
case is true. Therefore, at the end of that function 
,----
| message-make-date
`----
is called to create a valid date. Therefore, every time when running
nnrss-check-group the current date is used. On successive hashing of
the item, it is changed form the last update due to the new date.

Therefore I would suggest that the date should not be used in creating
the item hash, or at least it should be configurable to do so.

Or is there a better way of dealing with the problem?

Greetings,
Jochen
-- 
Einigkeit und Recht und Freiheit                http://www.Jochen-Kuepper.de
    Liberté, Égalité, Fraternité                GnuPG key: CC1B0B4D
        (Part 3 you find in my messages before fall 2003.)



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-09-16 13:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-10 20:25 duplicate items in nnrss groups Jochen Küpper
2006-09-10 21:48 ` David Hansen
     [not found]   ` <87slizbdgb.fsf-/PVU1wbgskXSxOb6xA1WPVLmfr9GSJok@public.gmane.org>
2006-09-11  8:07     ` Jochen Küpper
2006-09-14 16:51   ` Mark Plaksin
     [not found]     ` <wszmd2xuh3.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
2006-09-14 19:40       ` Jochen Küpper
2006-09-14 20:50         ` Mark Plaksin
     [not found]           ` <wsejuexjds.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
2006-09-14 21:34             ` Jochen Küpper
2006-09-15 16:12               ` Mark Plaksin
     [not found]                 ` <wsmz91un18.fsf-QSdS4MO2Nel/RRfBVNJ3YIdd74u8MsAO@public.gmane.org>
2006-09-16 13:02                   ` Jochen Küpper
     [not found]                     ` <9e64fo9d7v.fsf-X+QEHg5KIgm/8B4OpmtwqPxnRIzENc/G@public.gmane.org>
2006-09-16 13:54                       ` generated dates should not be hashed (was: duplicate items in nnrss groups) Jochen Küpper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).