SELECT * FROM RSS_FEED – querying feeds in SQL using Table Functions and XML parsing

In a recent article – Building an RSS Feed Reader in PL/SQL – using dbms_xmlparser, dbms_xmldom and dbms_xslprocessor for parsing and transforming – I discussed how we can create an RSS Feed Reader in PL/SQL. It turns out to be simple to parse an XML document found on any URL on the internet using the dbms_xmlparser package. In this article, I demonstrate how we can also apply XSLT stylesheet to transform the RSS data into more presentable formats, such as HTML. 

In this article, we take the RSS data for another ride. This time publishing the RSS items in SQL as an almost-Table using the concept of Table Functions. This allows us to query RSS Feeds out there on the internet as if they were part of our own database! In the final part of the article, we create an RSS News Feed Archive using dbms_job, the MERGE statement and our table function.

....

In order to be able to publish a Table Function, we need to define a Collection Type, a TABLE OF <SOME OBJECT TYPE>, that our function will return. Since the Table Function will represent a collection of RSS Items, the base type will be:

create type rss_item_type is object
( title varchar2(2000)
, link varchar2(200)
, pubDate varchar2(200)
, author varchar2(200)
, category varchar2(500)
, description varchar2(4000)
)
/
 

The Collection Type is nothing but a simple Nested Table of this type:

create type rss_item_type_tbl is 
table of rss_item_type
/
 

Now our function – that we may call a Table Function as it returns a Collection Type – looks like this:

create or replace 
function rss_feed
( p_url in varchar2
) return rss_item_type_tbl
is
l_rss_items_tbl rss_item_type_tbl:= rss_item_type_tbl();

rssFeedDoc dbms_xmldom.domdocument;
itemNode dbms_xmldom.domnode;
childnode dbms_xmldom.domnode;
rssItemsList dbms_xmldom.domnodeList ;
rssItemFields dbms_xmldom.domnodeList ;
childNodeText varchar2(32000);

l_rss_item rss_item_type;

begin
rssFeedDoc := dbms_xmlparser.parse(p_url);
rssItemsList:= dbms_xmldom.GETELEMENTSBYTAGNAME(rssFeedDoc, 'item');
for i in 1..dbms_xmldom.getlength(rssItemsList) loop
itemNode:= dbms_xmldom.item(rssItemsList,i-1);
rssItemFields:= dbms_xmldom.getChildNodes(itemNode);
l_rss_item:= rss_item_type('','','','','','' );
for j in 1..dbms_xmldom.getlength(rssItemFields) loop
childnode:= dbms_xmldom.item(rssItemFields, j-1);
childNodeText:= dbms_xmldom.getNodeValue(dbms_xmldom.getFirstChild(childnode));
case dbms_xmldom.getNodeName(childnode)
when 'title'
then l_rss_item.title := childNodeText;
when 'link'
then l_rss_item.link := childNodeText;
when 'pubDate'
then l_rss_item.pubDate := childNodeText;
when 'dc:creator'
then l_rss_item.author := childNodeText;
when 'description'
then l_rss_item.description := childNodeText;
when 'category'
then l_rss_item.category := childNodeText;
else null;
end case;
end loop;
l_rss_items_tbl.extend();
l_rss_items_tbl(l_rss_items_tbl.last):=l_rss_item;
end loop;
return l_rss_items_tbl;
end rss_feed;
/

 

This function roughly does the following: 

 

  • Parse the RSS Feed from the url specified in the p_url parameter into the rssFeedDoc variable
  • From this XML Document, we get a NodeList with all ITEM elements
  • We iterate over these ITEM nodes. For each ITEM we reinstantiate the l_rss_item variable
  • Using the (text)values from the child nodes of the current ITEM nodem we flesh out the l_rss_item
  • Finally we extend the l_rss_items_tbl collection with the newly created l_rss_item

 

So now we have a function that we can call with the URL of some RSS Feed that will return a rss_item_type_tbl. This allows us to do something rather trivial like selecting the recent stories on our own AMIS Technology Weblog:

select title
, pubDate
, author
from table( rss_feed('https://technology.amis.nl/blog/?feed=rss2'))
/
SELECT * FROM RSS_FEED - querying feeds in SQL using Table Functions and XML parsing rssSel1

 

We can make life a little simpler for our users by wrapping this table function in view definitions like this one:

create or replace 
function amis_blog_feed
return rss_item_type_tbl
is
begin
return rss_feed(p_url => 'https://technology.amis.nl/blog/?feed=rss2');
end amis_blog_feed;
/
create view amis_blog_rss
as
select *
from table(amis_blog_feed)
/

select title
from amis_blog_rss
/
 

Using this approach, we can create a series of Functions and Views for all RSS Feeds of interest.

create or replace 
function oracle_blogs_feed
return rss_item_type_tbl
is
begin
return rss_feed(p_url => 'http://blogs.oracle.com/readingLists/oracleblogs.xml');
end oracle_blogs_feed;
/

create or replace view oracle_blogs_rss
as
select *
from table(oracle_blogs_feed)
/
 

Merging RSS Feeds – Querying a bunch of RSS Feeds

Oracle 10g had this new feature that allows us to UNION together collections in PL/SQL. This brings a rather neat way of creating views that span multiple RSS Feeds and allowing users to query all RSS Items related to Sport, News or Technology. Let’s build a little foundation first: a table RSS_FEEDS that we use to record all RSS Feeds of interest:

create table rss_feeds
( title varchar2(200)
, url varchar2(500)
, category varchar2(50)
)
/
 

We insert a number of interesting RSS Feeds from three categories: News, Sport and Technology (Oracle and Java):

insert into rss_feeds
( title, url, category)
values
('BBC News Headlines', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml', 'news')
/
insert into rss_feeds
( title, url, category)
values
('CNN News', 'http://rss.cnn.com/rss/cnn_topstories.rss', 'news')
/
insert into rss_feeds
( title, url, category)
values
('Dutch News: nu.nl', 'http://www.nu.nl/deeplink_rss2/index.jsp?r=Algemeen', 'news')
/
insert into rss_feeds
( title, url, category)
values
('Dutch Sports News: nu.nl', 'http://www.nu.nl/deeplink_rss2/index.jsp?r=Sport', 'sport')
/
insert into rss_feeds
( title, url, category)
values
('BBC Sports News', 'http://newsrss.bbc.co.uk/rss/sportonline_world_edition/front_page/rss.xml', 'sport')
/
insert into rss_feeds
( title, url, category)
values
('ESPN Sport Headlines', 'http://sports.espn.go.com/espn/rss/news', 'sport')
/
insert into rss_feeds
( title, url, category)
values
('AMIS Technology Blog', 'https://technology.amis.nl/blog/?feed=rss2', 'Technology (Oracle and Java)')
/
insert into rss_feeds
( title, url, category)
values
('The Server Side - TSS Homepage News', 'http://feeds.feedburner.com/techtarget/tsscom/home', 'Technology (Oracle and Java)')
/
insert into rss_feeds
( title, url, category)
values
('The Server Side - J2EE Discussions', 'http://feeds.feedburner.com/techtarget/tsscom/j2eediscussions', 'Technology (Oracle and Java)')
/
insert into rss_feeds
( title, url, category)
values
('Oracle Technology Network - News', 'http://www.oracle.com/technology/syndication/rss_otn_news.xml', 'Technology (Oracle and Java)')
/
insert into rss_feeds
( title, url, category)
values
('DevX Latest Published Articles', 'http://services.devx.com/outgoing/devxfeed.xml', 'Technology (Oracle and Java)')
/
 

Now we can create the following function that returns a rss_item_type_tbl that we can use in our SQL queries, this time based on the joint RSS Feeds for the indicated category:

create or replace 
function rss_feeds_reader
( p_category in varchar2)
return rss_item_type_tbl
is
l_rss_items_tbl rss_item_type_tbl:= rss_item_type_tbl();
begin
for r in (select url from rss_feeds where category = p_category) loop
l_rss_items_tbl:= l_rss_items_tbl
MULTISET UNION
rss_feed(r.url);
end loop;
return l_rss_items_tbl;
end rss_feeds_reader;
/
 

It turns out that to be able to UNION together the rss_item_type_tbl collections, we have to define a so called MAP function on the RSS_ITEM_TYPE type. See for another example http://www.adp-gmbh.ch/blog/2006/01/08.html . If we do not have this MAP function – that is seemingly pointless – we get this error:

Errors for FUNCTION NEWS_FEEDS:

LINE/COL ERROR
-------- -----------------------------------------------------------------
7/5 PLS-00801: internal error [*** ASSERT at file pdw4.c, line 2045;
Type 0x0A3B7470 has no MAP method.; NEWS_FEEDS__UP__F__103002[7,
5]]
 

Defining the MAP function is as easy as:

alter type rss_item_type 
add map member function m
return number invalidate
/
 

Now we are set to combining RSS Feeds into grand RSS collections. One way of doing this would be by using wrappers per category, like this function for all news related RSS Feeds:

create or replace 
function news_feeds
return rss_item_type_tbl
is
begin
return rss_feeds_reader( p_category => 'news');
end news_feeds;
/
 

We can query this function like this: 

select title
from table(news_feeds)
order
by title

The results start like:

TITLE
-------------------------------------------------------
'RV' drives to box office success
'Troepen Uruzgan terug als werken onmogelijk is'
2 suspected terrorists killed in Iraq
Afghans free jailed US journalist
Argentina braced for mill protest
Author Pramoedya Ananta Toer dies
Blast at China coal mine kills 27
Car blast near Nigeria oil port
China payout in crackdown death
Darfur deal hangs in the balance
Dead soldier found in hotel air conditioner
Deel plan Bos valt slecht bij achterban
Drie aanhoudingen na hondengevecht
Dutch halt Kenya aid over graft
EU deadline on Mladic approaches
Economist, chronicler of affluent society dies at 97
Egypt extends its emergency laws
Energy chief: High gas prices could last 3 years
England lands Cricket World Cup
Europe heart care 'fails adults'
Football: Cup gamble for Rooney

We can create a similar summary function for Oracle and Java related RSS items:

create or replace 
function technology_feeds
return rss_item_type_tbl
is
begin
return rss_feeds_reader( p_category => 'Technology (Oracle and Java)');
end technology_feeds;
/
 

Instead of using these functions to wrap different categories, there is also the following approach.

create type string_tbl
is table of varchar2(250)
/
create or replace
function rss_feeds_tbl_reader
( p_rss_tbl in string_tbl)
return rss_item_type_tbl
is
l_rss_items_tbl rss_item_type_tbl:= rss_item_type_tbl();
l_idx integer:= p_rss_tbl.first;
begin
if l_idx is not null
then
loop
l_rss_items_tbl:= l_rss_items_tbl
MULTISET UNION
rss_feed(p_rss_tbl(l_idx));
l_idx:= p_rss_tbl.next(l_idx);
exit when l_idx is null;
end loop;
end if;
return l_rss_items_tbl;
end rss_feeds_tbl_reader;
/
 

Here the function rss_feeds_tbl_reader does not accept a single category, it expects a string_tbl collection of all the URLs it should collect RSS Feeds from. We can use this generic function in the following way in a simple SQL query:

select title
from table
( rss_feeds_tbl_reader
( cast
( multiset ( select url
from rss_feeds
where category = 'news'
) as string_tbl
)
)
)
/
 

The benefit of using this approach is that we can more easily specify – without the need for additional PL/SQL wrappers – which URLs to include when retrieving the RSS Feeds.

Build an RSS Archive

Now that we have such nice results, why not build an archive for all the RSS items we find interesting! One thing about RSS feeds is their volatility. Or at least the items in these feeds. Now you them, now you don’t. After a short while, these items are flushed out of the top 10 or 20 of whatever of the RSS Feed and they disappear without trace. What I will do in this section is build a very simple archive with the history of the RSS Feeds I find interesting. This archive is held in a single simple table:

create table rss_history
( title varchar2(2000)
, link varchar2(200)
, pubDate varchar2(200)
, author varchar2(200)
, category varchar2(500)
, description varchar2(4000)
)
/
 

With a rather straightforward MERGE statement we can refresh the archive with the current RSS Feed offering:

merge 
into rss_history rh
using ( select title
, link
, pubDate
, author
, description
, category
from table
( rss_feeds_tbl_reader
( cast
( multiset ( select url
from rss_feeds
where category = 'news'
) as string_tbl
)
)
)
) rl
on (rh.link = rl.link)
when not matched
then insert (title, link, pubDate, author, category, description)
values (rl.title, rl.link, rl.pubDate, rl.author, rl.category, rl.description)
/
 

 

Of course I can try to remember to execute this MERGE statement every now and again, but of course that won’t do the trick: I will miss out on a great number of RSS items as some feeds have rather short recycle times. An easy way out here is the use of a Job that will run say every fifteen minutes to automatically refresh the archive. I first wrap the MERGE statement in a PL/SQL procedure:

create or replace 
procedure update_rss_history
is
begin
merge
into rss_history rh
using ( select title
, link
, pubDate
, author
, description
, category
from table
( rss_feeds_tbl_reader
( cast
( multiset ( select url
from rss_feeds
where category = 'news'
) as string_tbl
)
)
)
) rl
on (rh.link = rl.link)
when not matched
then insert (title, link, pubDate, author, category, description)
values (rl.title, rl.link, rl.pubDate, rl.author, rl.category, rl.description)
;
end;
/
 

Then I create the Job that will perform the archive update operation every quarter of an hour:

declare
l_job binary_integer;
begin
dbms_job.submit
( job => l_job
, what => 'begin
update_rss_history;
commit;
end;'
, next_date => sysdate
, interval => 'sysdate + 15/60/24' -- refresh every quarter
);
commit;
end;
/
 

After scheduling this job, I can sit back an relax. I could take the occasional look at the job, to see if it is still running:

select job
, last_date
, last_sec
, this_date
, this_sec
, total_time
, what
from user_jobs
/
JOB LAST_DATE LAST_SEC TOTAL_TIME
---------- --------- -------- ----------
WHAT
------------------------------
246 30-APR-06 21:45:19 9
begin
update_rss_history;
commit;
end;

and of course I can check out the archive itself – after its first hour of operation, I have collected 90 newsitems. It stands to reason that I will quickly build up a substantial archive that allows me to browse the short term history as represented by News Headlines from prominent RSS Feeds. We will see more on this archive in article to come. 

select count(*) 
from rss_history
/
COUNT(*)
---------
90