Mastering XML DB - Unexpected side effect of updateXML startup 594127 1280

Mastering XML DB – Unexpected side effect of updateXML

When trying to create anonymous data, I encountered an unexpected side effect of the updateXML statement while updating XML documents in an Oracle 10gR2 (10.2.0.1.0) EE database.

I got a small test environment in which a table (HGO010_DETAM) resides that contains approximately 500Mb of XML documents. The real production environment is much bigger, think in 100th of Gigabytes per table and an average of 512 Kb (up to 6 Mb) per document. The web application that makes use of this data is document driven so these documents were stored in an XMLType column based on CLOB storage. The table also contains an extra ID column that was added for reference purposes.

The table was created as follows:

SQL> create table HGO010_DETAM
  2  (HGO_ID NUMBER(12), GEGEVENS XMLTYPE)
  3  ;

The full syntax would have been as shown via the dbms_metadata method:

SQL> select dbms_metadata.get_ddl('TABLE','HGO010_DETAM',USER) from dual;
  DBMS_METADATA.GET_DDL('TABLE','HGO010_DETAM','HGO')
--------------------------------------------------------------------------------  CREATE TABLE "HGO"."HGO010_DETAM"
   (    "HGO_ID" NUMBER(12,0) NOT NULL ENABLE,
        "GEGEVENS" "SYS"."XMLTYPE"  NOT NULL ENABLE,
         CONSTRAINT "FK_HGO010_HGO000" FOREIGN KEY ("HGO_ID")
          REFERENCES "HGO"."HGO000_SOFI" ("HGO_ID") ON DELETE CASCADE ENABLE
   ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
  STORAGE(INITIAL 2097152 NEXT 2097152 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
  TABLESPACE "DAT_TS01"
 XMLTYPE COLUMN "GEGEVENS" STORE AS CLOB (
  TABLESPACE "DAT_TS01" ENABLE STORAGE IN ROW CHUNK 8192 PCTVERSION 10
  NOCACHE LOGGING
  STORAGE(INITIAL 2097152 NEXT 2097152 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT))

A typical document in this table looks like this:

SQL> desc hgo.hgo010_detam
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 HGO_ID                                    NOT NULL NUMBER(12)
 GEGEVENS                                  NOT NULL SYS.XMLTYPE
SQL> select  *
  2  from    hgo.hgo010_detam t
  3  where   t.hgo_id=18383;
    HGO_ID
----------
T.GEGEVENS.EXTRACT('/*')
--------------------------------------------------------------------------------
     18383
<wwb:WW-HISTORIE xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:wwb
="http://www.uwv.nl/ww/historie/detam/WWBase" xmlns:wwe="http://www.uwv.nl/ww/hi
storie/detam/WWEntiteiten">
  <R801FINR>
    <R801-NUM-FIS>012345678</R801-NUM-FIS>
    <R801-NUM-DET>426270670105</R801-NUM-DET>
    <R801-IND-MUT-NUM-DET/>
    <R802GEVALSNR>
      <R802-NUM-GVL-WW>94628838</R802-NUM-GVL-WW>
      <R806GEVALSNR>
        <R806-NUM-GVL-WW>94628838</R806-NUM-GVL-WW>
        <R805LIDNR>
          <R805-NUM-LID>09045451</R805-NUM-LID>
        </R805LIDNR>
      </R806GEVALSNR>
      <OUDWWREC>
        <R815GEVFIL1>0</R815GEVFIL1>
        <R815GEVALSNR>94628838</R815GEVALSNR>
        <R815-NUM-FIS>220111765</R815-NUM-FIS>
        <R815LIDNR>09045451</R815LIDNR>
        <R815INDWW>4</R815INDWW>
        <R815-CAT-WW>04</R815-CAT-WW>
        <R815-DAT-BGN-WW>19950703</R815-DAT-BGN-WW>
        <R815-DAT-END-WW>19950726</R815-DAT-END-WW>

So as you can see, the document contains white spaces and has a “pretty print” layout. This is how the data was loaded in the XMLType column. The XML documents held there original layout because the XMLType column has a CLOB based storage (and therefore the data isn’t shredded as in XMLType columns based on Object Relational storage). XMLType CLOB storage stores unstructured data “as is”.

I needed to create anonymous data, for instance regarding the <R801-NUM-FIS> element. So tried an update of the data via the statement:

SQL> update hgo.hgo010_detam t
     set    t.gegevens=updateXML
     (t.gegevens,'/wwb:WW-HISTORIE/R801FINR/R801-NUM-FIS/text()'
      ,'Marco','xmlns:wwb="http://www.uwv.nl/ww/historie/detam/WWBase"')
     where  hgo_id=18383;
 
1 row updated.

The update statement worked fine but I was surprised when I re-examined the data and saw the following:

SQL> select  *
  2  from    hgo.hgo010_detam t
  3  where   t.hgo_id=18383;
 
    HGO_ID
----------
GEGEVENS
--------------------------------------------------------------------------------
     18383
<?xml version="1.0" encoding="ISO-8859-1"?><wwb:WW-HISTORIE xmlns:xsi="http://ww
w.w3.org/2001/XMLSchema-instance" xmlns:wwb="http://www.uwv.nl/ww/historie/detam
/WWBase" xmlns:wwe="http://www.uwv.nl/ww/historie/detam/WWEntiteiten"><R801FINR>
<R801-NUM-FIS>Marco</R801-NUM-FIS><R801-NUM-DET>426270670105</R801-NUM-DET><R801
-IND-MUT-NUM-DET/><R802GEVALSNR><R802-NUM-GVL-WW>94628838</R802-NUM-GVL-WW><R806
GEVALSNR><R806-NUM-GVL-WW>94628838</R806-NUM-GVL-WW><R805LIDNR><R805-NUM-LID>090
45451</R805-NUM-LID></R805LIDNR></R806GEVALSNR><OUDWWREC><R815GEVFIL1>0</R815GEV
FIL1><R815GEVALSNR>94628838</R815GEVALSNR><R815-NUM-FIS>220111765</R815-NUM-FIS>
...etc.etc.etc

Pretty print layout and all the white spaces where gone…

I mentioned the effect on the Oracle TechNet XMLDB forum (http://forums.oracle.com/forums/forum.jspa?forumID=34) and Mark Drake guessed that the following was probably happening under the covers (http://forums.oracle.com/forums/thread.jspa?threadID=399742&tstart=0):

“We had to parse the XML into a DOM to perform the update, perform the update using the DOM API methods and then reserialize the DOM into text after the update was complete. We only preserve whitespace during parsing when the XML is schema based and the element is defined as mixed=”true” in the XML Schema, or in the case xml:space=”preserve” (Both of these cases require patches to work as expected). So the whitespace (pretty print) was lost when the XML was parsed prior to performing the update.”

“Since the XML is stored as CLOB we do not reparse it to pretty print when the row is selected..”

“You can force a pretty print by selecting”

“select HGO_ID, t.GEGEVENS.extract(‘/*’) from hgo010_detam t”

This sounds very plausible, testing the extract statement the following is shown in SQL*Plus:

SQL> select HGO_ID, t.GEGEVENS.extract('/*') from hgo010_detam t
  2  where t.hgo_id=18383;
 
    HGO_ID
----------
T.GEGEVENS.EXTRACT('/*')
--------------------------------------------------------------------------------
     18383
<wwb:WW-HISTORIE xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:wwb
="http://www.uwv.nl/ww/historie/detam/WWBase" xmlns:wwe="http://www.uwv.nl/ww/hi
storie/detam/WWEntiteiten">
  <R801FINR>
    <R801-NUM-FIS>Marco</R801-NUM-FIS>
    <R801-NUM-DET>426270670105</R801-NUM-DET>
    <R801-IND-MUT-NUM-DET/>
    <R802GEVALSNR>
      <R802-NUM-GVL-WW>94628838</R802-NUM-GVL-WW>
      <R806GEVALSNR>
        <R806-NUM-GVL-WW>94628838</R806-NUM-GVL-WW>
        <R805LIDNR>
          <R805-NUM-LID>09045451</R805-NUM-LID>
        </R805LIDNR>
      </R806GEVALSNR>
      <OUDWWREC>
...etc..etc..etc

In my small environment the referenced table contains 7500 documents. The original total size of these documents on disks is 437.796 Kb (when shown in Windows explorer). When checking on the sizes of these (re-rendered) text documents in the mentioned table, the sum total size is 450136538 bytes. The smallest XML document in this table is 1584 bytes and the biggest XML document in this table is 1138638 bytes.

SQL> select min(dbms_lob.getlength(t.gegevens.getclobval())) "MIN"
  2  from hgo.hgo010_detam t
  3  ;
MIN
---------
1584
1 row selected.
SQL> select max(dbms_lob.getlength(t.gegevens.getclobval())) "MAX"
  2  from hgo.hgo010_detam t
  3  ;
MAX
---------
1138638
1 row selected.
SQL> select sum(dbms_lob.getlength(t.gegevens.getclobval())) "SUM"
  2  from hgo.hgo010_detam t
  3  ;
SUM
---------
450136538
1 row selected.

After updating all XML documents via an updateXML statement and as a side effect of this, removing all white spaces:

SQL> update hgo.hgo010_detam t
     set t.gegevens=updateXML
     (t.gegevens,'/wwb:WW-HISTORIE/R801FINR/R801-NUM-FIS/text()'
     ,'012345678','xmlns:wwb="http://www.uwv.nl/ww/historie/detam/WWBase"');
7500 rows updated.

After this update the same SUM statement as shown above gave the following result:

SQL> select sum(dbms_lob.getlength(t.gegevens.getclobval())) "SUM"
  2  from hgo.hgo010_detam t
  3  ;
SUM
---------
392869215
1 row selected.

This is 57267323 bytes less (~ 55 Mb).

SQL> select 450136538-392869215 "LESS" from dual;
LESS
---------
57267323
1 row selected.

In conclusion – the updateXML statements has  (at least in our case) nice side effects; it removes the white spaces which causes less space consumption and will allow XML parsers to be more effective while parsing the XML documents. Keep in mind that, if you address DOM methods, this will be very memory intensive (your document will be rebuild in memory).