SELECT * FROM RSS_FEED – querying feeds in SQL using Table Functions and XML parsing

0

In a recent article – Building an RSS Feed Reader in PL/SQL – using dbms_xmlparser, dbms_xmldom and dbms_xslprocessor for parsing and transforming – I discussed how we can create an RSS Feed Reader in PL/SQL. It turns out to be simple to parse an XML document found on any URL on the internet using the dbms_xmlparser package. In this article, I demonstrate how we can also apply XSLT stylesheet to transform the RSS data into more presentable formats, such as HTML. 

In this article, we take the RSS data for another ride. This time publishing the RSS items in SQL as an almost-Table using the concept of Table Functions. This allows us to query RSS Feeds out there on the internet as if they were part of our own database! In the final part of the article, we create an RSS News Feed Archive using dbms_job, the MERGE statement and our table function.

....

In order to be able to publish a Table Function, we need to define a Collection Type, a TABLE OF <SOME OBJECT TYPE>, that our function will return. Since the Table Function will represent a collection of RSS Items, the base type will be:

create type rss_item_type is object<br /> ( title    varchar2(2000)<br /> , link     varchar2(200)<br /> , pubDate  varchar2(200)<br /> , author   varchar2(200)<br /> , category varchar2(500)<br /> , description varchar2(4000)<br /> )<br /> /<br />&nbsp;

The Collection Type is nothing but a simple Nested Table of this type:

create type rss_item_type_tbl is <br />table of rss_item_type<br />/<br />&nbsp;

Now our function – that we may call a Table Function as it returns a Collection Type – looks like this:

create or replace <br />function rss_feed<br />( p_url in varchar2<br />) return rss_item_type_tbl<br />is<br />  l_rss_items_tbl rss_item_type_tbl:= rss_item_type_tbl();<br /><br /> rssFeedDoc    dbms_xmldom.domdocument;<br /> itemNode      dbms_xmldom.domnode;<br /> childnode     dbms_xmldom.domnode;<br /> rssItemsList  dbms_xmldom.domnodeList  ;<br /> rssItemFields dbms_xmldom.domnodeList  ;<br /> childNodeText varchar2(32000);<br /> <br /> l_rss_item        rss_item_type;<br /><br />begin<br />   rssFeedDoc   := dbms_xmlparser.parse(p_url);<br />   rssItemsList:= dbms_xmldom.GETELEMENTSBYTAGNAME(rssFeedDoc, 'item');<br />   for i in 1..dbms_xmldom.getlength(rssItemsList) loop<br />     itemNode:= dbms_xmldom.item(rssItemsList,i-1);<br />     rssItemFields:= dbms_xmldom.getChildNodes(itemNode);<br />     l_rss_item:= rss_item_type('','','','','','' );<br />     for j in 1..dbms_xmldom.getlength(rssItemFields) loop<br />       childnode:= dbms_xmldom.item(rssItemFields, j-1);<br />       childNodeText:= dbms_xmldom.getNodeValue(dbms_xmldom.getFirstChild(childnode));<br />       case dbms_xmldom.getNodeName(childnode)<br />       when 'title'<br />       then l_rss_item.title := childNodeText;<br />       when 'link'<br />       then l_rss_item.link := childNodeText;<br />       when 'pubDate'<br />       then l_rss_item.pubDate := childNodeText;<br />       when 'dc:creator'<br />       then l_rss_item.author := childNodeText;<br />       when 'description'<br />       then l_rss_item.description := childNodeText;<br />       when 'category'<br />       then l_rss_item.category := childNodeText;<br />       else null;<br />       end case;             <br />     end loop;<br />     l_rss_items_tbl.extend();<br />     l_rss_items_tbl(l_rss_items_tbl.last):=l_rss_item; <br />   end loop;<br />   return l_rss_items_tbl;<br />end rss_feed;<br />/<br /><br />

 

This function roughly does the following: 

 

  • Parse the RSS Feed from the url specified in the p_url parameter into the rssFeedDoc variable
  • From this XML Document, we get a NodeList with all ITEM elements
  • We iterate over these ITEM nodes. For each ITEM we reinstantiate the l_rss_item variable
  • Using the (text)values from the child nodes of the current ITEM nodem we flesh out the l_rss_item
  • Finally we extend the l_rss_items_tbl collection with the newly created l_rss_item

 

So now we have a function that we can call with the URL of some RSS Feed that will return a rss_item_type_tbl. This allows us to do something rather trivial like selecting the recent stories on our own AMIS Technology Weblog:

<p>select title<br />,      pubDate<br />,      author<br />from   table( rss_feed('http://technology.amis.nl/blog/?feed=rss2'))<br />/<br /><img src="http://technology.amis.nl/wp-content/uploads/images/rssSel1.jpg" /> <br /></p>

 

We can make life a little simpler for our users by wrapping this table function in view definitions like this one:

create or replace <br />function amis_blog_feed<br />return rss_item_type_tbl<br />is<br />begin<br />  return rss_feed(p_url =&gt; 'http://technology.amis.nl/blog/?feed=rss2');<br />end amis_blog_feed;<br />/<br />create view amis_blog_rss<br />as<br />select *<br />from   table(amis_blog_feed)<br />/<br /><br />select title<br />from   amis_blog_rss<br />/<br />&nbsp;

Using this approach, we can create a series of Functions and Views for all RSS Feeds of interest.

create or replace <br />function oracle_blogs_feed<br />return rss_item_type_tbl<br />is<br />begin<br />  return rss_feed(p_url =&gt; 'http://blogs.oracle.com/readingLists/oracleblogs.xml');<br />end oracle_blogs_feed;<br />/<br /><br />create or replace view oracle_blogs_rss<br />as<br />select * <br />from   table(oracle_blogs_feed)<br />/<br />&nbsp;

Merging RSS Feeds – Querying a bunch of RSS Feeds

Oracle 10g had this new feature that allows us to UNION together collections in PL/SQL. This brings a rather neat way of creating views that span multiple RSS Feeds and allowing users to query all RSS Items related to Sport, News or Technology. Let’s build a little foundation first: a table RSS_FEEDS that we use to record all RSS Feeds of interest:

create table rss_feeds<br />( title varchar2(200)<br />, url   varchar2(500)<br />, category varchar2(50) <br />)<br />/<br />&nbsp;

We insert a number of interesting RSS Feeds from three categories: News, Sport and Technology (Oracle and Java):

insert into rss_feeds<br />( title, url, category)<br />values<br />('BBC News Headlines', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml', 'news')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('CNN News', 'http://rss.cnn.com/rss/cnn_topstories.rss', 'news')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('Dutch News: nu.nl', 'http://www.nu.nl/deeplink_rss2/index.jsp?r=Algemeen', 'news')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('Dutch Sports News: nu.nl', 'http://www.nu.nl/deeplink_rss2/index.jsp?r=Sport', 'sport')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('BBC Sports News', 'http://newsrss.bbc.co.uk/rss/sportonline_world_edition/front_page/rss.xml', 'sport')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('ESPN Sport Headlines', 'http://sports.espn.go.com/espn/rss/news', 'sport')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('AMIS Technology Blog', 'http://technology.amis.nl/blog/?feed=rss2', 'Technology (Oracle and Java)')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('The Server Side - TSS Homepage News', 'http://feeds.feedburner.com/techtarget/tsscom/home', 'Technology (Oracle and Java)')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('The Server Side - J2EE Discussions', 'http://feeds.feedburner.com/techtarget/tsscom/j2eediscussions', 'Technology (Oracle and Java)')<br />/<br />insert into rss_feeds<br />( title, url, category)<br /
/>values<br />('Oracle Technology Network - News', 'http://www.oracle.com/technology/syndication/rss_otn_news.xml', 'Technology (Oracle and Java)')<br />/<br />insert into rss_feeds<br />( title, url, category)<br />values<br />('DevX Latest Published Articles', 'http://services.devx.com/outgoing/devxfeed.xml', 'Technology (Oracle and Java)')<br />/<br />&nbsp;

Now we can create the following function that returns a rss_item_type_tbl that we can use in our SQL queries, this time based on the joint RSS Feeds for the indicated category:

create or replace <br />function rss_feeds_reader<br />( p_category in varchar2)<br />return rss_item_type_tbl<br />is<br />  l_rss_items_tbl rss_item_type_tbl:= rss_item_type_tbl();<br />begin<br />  for r in (select url from rss_feeds where category = p_category) loop<br />    l_rss_items_tbl:= l_rss_items_tbl <br />                      MULTISET UNION<br />                      rss_feed(r.url);<br />  end loop;<br />  return l_rss_items_tbl;<br />end rss_feeds_reader;<br />/<br />&nbsp;

It turns out that to be able to UNION together the rss_item_type_tbl collections, we have to define a so called MAP function on the RSS_ITEM_TYPE type. See for another example http://www.adp-gmbh.ch/blog/2006/01/08.html . If we do not have this MAP function – that is seemingly pointless – we get this error:

Errors for FUNCTION NEWS_FEEDS:<br /><br />LINE/COL ERROR<br />-------- -----------------------------------------------------------------<br />7/5      PLS-00801: internal error [*** ASSERT at file pdw4.c, line 2045;<br />         Type 0x0A3B7470 has no MAP method.; NEWS_FEEDS__UP__F__103002[7,<br />         5]]<br />&nbsp;

Defining the MAP function is as easy as:

alter type rss_item_type <br />add map member function m  <br />return number invalidate<br />/<br />&nbsp;

Now we are set to combining RSS Feeds into grand RSS collections. One way of doing this would be by using wrappers per category, like this function for all news related RSS Feeds:

create or replace <br />function news_feeds<br />return rss_item_type_tbl<br />is<br />begin<br />  return rss_feeds_reader( p_category =&gt; 'news');<br />end news_feeds;<br />/<br />&nbsp;

We can query this function like this: 

select title<br />from   table(news_feeds)<br />order<br />by     title<br />/&nbsp;<br />

The results start like:

TITLE<br />-------------------------------------------------------<br />'RV' drives to box office success<br />'Troepen Uruzgan terug als werken onmogelijk is'<br />2 suspected terrorists killed in Iraq<br />Afghans free jailed US journalist<br />Argentina braced for mill protest<br />Author Pramoedya Ananta Toer dies<br />Blast at China coal mine kills 27<br />Car blast near Nigeria oil port<br />China payout in crackdown death<br />Darfur deal hangs in the balance<br />Dead soldier found in hotel air conditioner<br />Deel plan Bos valt slecht bij achterban<br />Drie aanhoudingen na hondengevecht<br />Dutch halt Kenya aid over graft<br />EU deadline on Mladic approaches<br />Economist, chronicler of affluent society dies at 97<br />Egypt extends its emergency laws<br />Energy chief: High gas prices could last 3 years<br />England lands Cricket World Cup<br />Europe heart care 'fails adults'<br />Football: Cup gamble for Rooney <br />

We can create a similar summary function for Oracle and Java related RSS items:

create or replace <br />function technology_feeds<br />return rss_item_type_tbl<br />is<br />begin<br />  return rss_feeds_reader( p_category =&gt; 'Technology (Oracle and Java)');<br />end technology_feeds;<br />/<br />&nbsp;

Instead of using these functions to wrap different categories, there is also the following approach.

create type string_tbl<br />is table of varchar2(250)<br />/<br />create or replace <br />function rss_feeds_tbl_reader<br />( p_rss_tbl in string_tbl)<br />return rss_item_type_tbl<br />is<br />  l_rss_items_tbl rss_item_type_tbl:= rss_item_type_tbl();<br />  l_idx integer:= p_rss_tbl.first;<br />begin<br />  if l_idx is not null<br />  then <br />    loop  <br />      l_rss_items_tbl:= l_rss_items_tbl <br />                        MULTISET UNION<br />                        rss_feed(p_rss_tbl(l_idx));<br />      l_idx:= p_rss_tbl.next(l_idx);<br />      exit when l_idx is null;<br />    end loop;<br />  end if;<br />  return l_rss_items_tbl;<br />end rss_feeds_tbl_reader;<br />/<br />&nbsp;

Here the function rss_feeds_tbl_reader does not accept a single category, it expects a string_tbl collection of all the URLs it should collect RSS Feeds from. We can use this generic function in the following way in a simple SQL query:

select title<br />from   table<br />       ( rss_feeds_tbl_reader<br />         ( cast<br />           ( multiset ( select url<br />                        from   rss_feeds<br />                        where  category = 'news'<br />                      ) as string_tbl<br />           )<br />         )<br />       )<br />/<br />&nbsp;

The benefit of using this approach is that we can more easily specify – without the need for additional PL/SQL wrappers – which URLs to include when retrieving the RSS Feeds.

Build an RSS Archive

Now that we have such nice results, why not build an archive for all the RSS items we find interesting! One thing about RSS feeds is their volatility. Or at least the items in these feeds. Now you them, now you don’t. After a short while, these items are flushed out of the top 10 or 20 of whatever of the RSS Feed and they disappear without trace. What I will do in this section is build a very simple archive with the history of the RSS Feeds I find interesting. This archive is held in a single simple table:

create table rss_history<br /> ( title    varchar2(2000)<br /> , link     varchar2(200)<br /> , pubDate  varchar2(200)<br /> , author   varchar2(200)<br /> , category varchar2(500)<br /> , description varchar2(4000)<br /> )<br /> /<br />&nbsp;

With a rather straightforward MERGE statement we can refresh the archive with the current RSS Feed offering:

merge <br />into rss_history rh <br />using ( select title<br />        ,      link<br />        ,      pubDate<br />        ,      author<br />        ,      description<br />        ,      category<br />        from   table<br />               ( rss_feeds_tbl_reader<br />                 ( cast<br />                   ( multiset ( select url<br />                                from   rss_feeds<br />                                where  category = 'news'<br />                              ) as string_tbl<br />                   )<br />                 )<br />               )<br />       ) rl<br />on (rh.link = rl.link)<br />when not matched<br />then insert (title, link, pubDate, author, category, description)<br />     values (rl.title, rl.link, rl.pubDate, rl.author, rl.category, rl.description)      <br />/<br />&nbsp;

 

Of course I can try to remember to execute this MERGE statement every now and again, but of course that won’t do the trick: I will miss out on a great number of RSS items as some feeds have rather short recycle times. An easy way out here is the use of a Job that will run say every fifteen minutes to automatically refresh the archive. I first wrap the MERGE statement in a PL/SQL procedure:

create or replace <br />procedure update_rss_history<br />is<br />begin<br />  merge <br />  into rss_history rh <br />  using ( select title<br />          ,      link<br />          ,      pubDate<br />          ,      author<br />          ,      description<br />          ,      category<br />          from   table<br />                 ( rss_feeds_tbl_reader<br />                   ( cast<br />                     ( multiset ( select url<br />                                  from   rss_feeds<br />                                  where  category = 'news'<br />                                ) as string_tbl<br /

/>                     )<br />                   )<br />                 )<br />         ) rl<br />  on (rh.link = rl.link)<br />  when not matched<br />  then insert (title, link, pubDate, author, category, description)<br />       values (rl.title, rl.link, rl.pubDate, rl.author, rl.category, rl.description)      <br />       ;<br />end;     <br />/<br />&nbsp;

Then I create the Job that will perform the archive update operation every quarter of an hour:

declare<br />  l_job binary_integer;<br />begin<br />  dbms_job.submit<br />  ( job =&gt; l_job<br />  , what =&gt; 'begin <br />               update_rss_history; <br />               commit; <br />             end;'<br />  , next_date =&gt; sysdate<br />  , interval =&gt; 'sysdate + 15/60/24'  -- refresh every quarter<br />  );<br />  commit;<br />end;  <br />/<br />&nbsp;

After scheduling this job, I can sit back an relax. I could take the occasional look at the job, to see if it is still running:

select job<br />,      last_date<br />,      last_sec<br />,      this_date<br />,      this_sec<br />,      total_time<br />,      what<br />from user_jobs  <br />/<br />       JOB LAST_DATE LAST_SEC TOTAL_TIME<br />---------- --------- -------- ----------<br />WHAT<br />------------------------------<br />       246 30-APR-06 21:45:19          9<br />begin<br />  update_rss_history;<br />  commit;<br />end; <br />

and of course I can check out the archive itself – after its first hour of operation, I have collected 90 newsitems. It stands to reason that I will quickly build up a substantial archive that allows me to browse the short term history as represented by News Headlines from prominent RSS Feeds. We will see more on this archive in article to come. 

select count(*) <br />from rss_history<br />/ <br /> COUNT(*)<br />---------<br />       90<br /><p><br /></p>
<p>&nbsp;</p>
Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

Comments are closed.