At home I have a small database which contains a collection of my music tracks. Among other things I use this database to create random play lists. The tables in the database contain information about artists, albums, tracks and the location of the actual file on the file system. This “metadata� is generated by another application in the form of XML files. Ever since I started learning about XML I was intrigued by how to pick up these file’s, extract the data out of them and store that in the database. While roaming the internet I came across a lot of examples, mostly about bookstores, but they were never quit what I was searching for. But now I found one that helped my to finally grasp a approach that works.
It is an Oracle demo: Oracle XML DB Downloadable Demo
This demo is about the classical order/order line situation. But now they are available as XML files and the demo shows how to process these files/data. The demo uses XML schema, annotations and the Oracle XML DB repository. In a nutshell this is how it works: Create a XML schema that describes the XML, uses annotations to assign objects to store the data, register this schema in Oracle XML DB repository, insert the XML file into the objects using the createschemabasexml method, write a view on top of the object table and process the data. This is a slightly different approach then demonstrated in the demo because I don’t fully understand it yet.
Lets look at the steps in more detail. The first step is to take a look at the data source: the XML files that contain data about artist and tracks. luckily these are very simple and look like this:
They consists of series of media file “sections� which contains the data I want to store in the database. I used Jdeveloper 10g to create a XML schema which describes this format.
The design view looks like the picture below.
Roughly speaking this picture tells us that an XML file contains a BODY which is build up with a HEAD part and 1 to100 MEDIA FILE part(s). Each MEDIA FILE consists of the shown elements. This XML schema describes how the XML should be build up. It’s like a constraint. If an XML file does not match this schema it is rejected during insert. Another important tool you can use in XML schema are annotations. These annotations give you control over how the XML files will be stored. I do not quite fully understand how it works and what best practices are. So maybe it is silly what I did, lets say it a first step. In every XML file there is a hierarchy. The root element (BODY) contains all the others and the MEDIA FILE part is a collection of similar items. With the annotations you can tell Oracle to create a table for the root element:
xsd:element name=”body” xdb:defaultTable=”MZK_TRACKS_IMP”
for the MEDIA FILE items you can define:
xsd:element name=”media_file” minOccurs=”1″ maxOccurs=”100″
xdb:SQLName=”MEDIAFILE_REC”
xdb:SQLCollType=”MEDIAFILE_COLLECTION”
xsd:complexType
xdb:SQLType=”MEDIAFILE_TYPE”
These examples are not all the annotations I used. The next step is to register this schema in the Oracle XML DB repository. In the demo they use Ftp to pick up the file. I first placed the file in a Clob and worked from there. A schema is registered with a call to:
dbms_xmlschema.registerschema(‘http://www.tuwtuw.nl/xsd/mzk_imp_schema.xsd’,v_clob)
After this procedure has finished Oracle created objects like:
A object table to store the XML files:
TABLE MZK_TRACKS_IMP OF SYS.XMLTYPE
XMLSCHEMA “http://www.tuwtuw.nl/xsd/mzk_imp_schema.xsd”
ELEMENT “body”
In this table you can find the hierarchical structure of the XML file. Because I used annotations one of them is:
TYPE “MEDIAFILE_COLLECTION” AS VARRAY(100) OF “MEDIAFILE_TYPE”
TYPE “MEDIAFILE_TYPE” AS OBJECT (“SYS_XDBPD$” “XDB”.”XDB$RAW_LIST_T”
,”PATH” VARCHAR2 (4000 CHAR)
,”NAME” VARCHAR2 (4000 CHAR)
,”SIZE_IN_BYTES” NUMBER (38)
,”ARTIST” VARCHAR2 (4000 CHAR)
,”ALBUM” VARCHAR2 (4000 CHAR)
,”TITLE” VARCHAR2 (4000 CHAR)
,”YEAR” NUMBER (38)
,”TRACK” NUMBER (38))FINAL INSTANTIABLE
Now I can insert XML files into the table and doing so the XML is shredded into the structure of the object table. In the demo again they use Ftp to insert the data. I used Clob storage for the XML file and worked from there.
insert into mzk_tracks_imp values
(xmltype(v_clob).createschemabasedxml(‘http://www.tuwtuw.nl/xsd/mzk_imp_schema.xsd’));
During the insert there is a call to the method createschemabasedxml with the schema as parameter. As I understand it this call will shred the XML document into pieces which are stored in the underlying object table structure. The data can now be queried using the extractvalue function like:
select extractvalue(value(tip),’/media_file/title’) title
from mzk_tracks_imp tip
with the result 1 row selected. The MEDIA FILE part of the XML file is a collection and has to be transformed using the table function
select extractvalue(value(tmf),’/media_file/title’) title
from mzk_tracks_imp tip
, table(xmlsequence(extract(value(tip),’/body/media_file’))) tmf
with result:
Born At The Right Time
Me & Julio Down By The School
Graceland
…
…
The final step is to create a view on top of the object table and use it to query the data I want to process
create or replace view mzk_tracks_imp_vw
(PATH, NAME, SIZE_IN_BYTES,ARTIST,ALBUM,TITLE,TRACK,YEAR)
as
select extractvalue(value(tmf),’/media_file/path’) path
, extractvalue(value(tmf),’/media_file/name’) name
, extractvalue(value(tmf),’/media_file/size_in_bytes’) size_in_bytes
, extractvalue(value(tmf),’/media_file/artist’) artist
, extractvalue(value(tmf),’/media_file/album’) album
, extractvalue(value(tmf),’/media_file/title’) title
, extractvalue(value(tmf),’/media_file/track’) track
, extractvalue(value(tmf),’/media_file/year’) year
from mzk_tracks_imp tip
, table(xmlsequence(extract(value(tip),’/body/media_file’))) tmf
Altough a lot of issues are not quit clear to me this demo helped me to understand this type of XML processing. I now can create a process which picks up the files, places them in Clobs and insert them as object-relational data in my mzk_tracks_imp table. The view gives me easy and “normal” query possibilities to process the data. I would like to get feedback about this approach to further improve this process or stop it all together because it make no sense what so ever.