Page 1 of 1

RSS Feed Bugs

Posted: Tue Jun 17, 2008 4:36 pm
by jmaxwilson
I've been writing a program that consumes RSS Feeds, and as I have been testing my program using the RSS Feeds provided by the church I have found that there are some bugs in the feeds that can cause some really annoying problems.

I know that the church's feeds are processed through feedburner.com, so the XML is well formed and according to the specification. But there are some values that are merely propagated from the original feed through feedburner, and it is these values that are problematic.

Here are the problems:

1. Changing GUIDs

Each item in the feed has a guid that is supposed to uniquely identify the item. Using the guid to identify items allows me to simply update the record that I have, even if the title, link, etc change. This is important when consume feeds to avoid storing and displaying duplicate entries. If an item is posted and then the author changes the title or the url of the item I don't want to have a duplicate entry in my database or my display. As long as the guid remains the same after the initial post, I can simply update the record based on that key.

However, the guid values in the feeds generated by church websites do change, resulting in duplicate entries. I noticed this in Google Reader, but it wasn't until I was working with my own program that I noticed why.

For example, today the LDS Newsroom feed posted an article on members aiding victims of flooding in Indiana, Iowa, and Wisconsin

My feed reader quickly picked up the item. When it was first published this was the guid entry in the feed for the item:

Code: Select all

<guid>http://www.lds.org/ldsnewsroom/v/index.jsp?vgnextoid=9ae411154963d010VgnVCM1000004e94610aRCRD</guid>
However, not long afterward, my reader parsed the feed again and it had changed. It now reads:

Code: Select all

<guid>http://www.lds.org/ldsnewsroom/eng/news-releases-stories/mormons-aid-flood-victims-in-indiana-iowa-and-wisconsin</guid>
http://www.lds.org/ldsnewsroom/eng/news ... -wisconsin

So my program, and other feed consuming programs I would assume, create duplicate entries.

The guid should be some string that can be used to uniquely identify the item, and that will not change. It appears that the current feed generation through Vignette uses the permalink as both the link and the guid for the item, so when the url changes from the ugly vignette oid to the search engine friendly url the guid is also being updated. Fixing this may be as simple as a small change to an XSLT document.

2. No time in pubDate

The second problem is that lastBuildDate of the feed and the pubDate of the items in the church RSS feeds is always set to 00:00:00 .

Code: Select all

<lastBuildDate>Tue, 17 Jun 2008 00:00:00 -0700</lastBuildDate>

Code: Select all

<pubDate>Tue, 17 Jun 2008 00:00:00 -0700</pubDate>
This causes two problems.

The first is that the lastBuildDate only changes once a day. If new items are posted part way through the day, some feed parsers may not be triggered to look for new data because the lastBuildDate is the same as the last time it was checked and so they assume that therefore there is no need to parse the rest of the feed xml. This causes all of the new feed items to show up in the reader only once a day when the lastBuildDate changes. By adding the time to the lastBuildDate, items will be seen by readers more quickly and the information will be more timely.

The second problem is that because none of the items has a time in the pubDate, they don't always appear at the top in lists sorted by date (as they often are) when they are mixed in with items from other feeds. For instance, lets say that I aggregate multiple feeds. Between 7:00am and 12:00 noon I pick up 10 items from various news and blog feeds. Then at 1:23pm the church newsroom posts a new article. The feed reader picks it up, but because the pubDate in the feed lists the time as 00:00:00 -700 the headline is placed in the date sorted list as having been posted before all 10 of the already aggregated items from other sources. This causes unneeded confusion and makes it less likely for church content to been seen because it has already been pushed down below the fold or even off the page, even though it is new and should be at the top.

I hope that the website developers for the feeds at lds.org can fix the guid, lastBuildDate, and pubDate for the feeds. Not only to help those of use who are trying to consume the feeds, but to make sure that the content arrives to readers in a timely manner and does not drive away readers by creating duplicates.

Let me know if there are other considerations that have not occurred to me.

Thanks,

J. Max Wilson
http://www.sixteensmallstones.org