Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL

13

Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.

Say you have a Word document like this

Then you can query the text from it like this

As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.

Anton

And here’s a link with the used code: as_zip
package with zip and unzip

** Changelog:
** Date: 29-04-2012
** fixed bug for large uncompressed files, thanks Morten Braten
** Date: 21-03-2012
** Take CRC32, compressed length and uncompressed length from
** Central file header instead of Local file header
** Date: 17-02-2012
** Added more support for non-ascii filenames
** Date: 25-01-2012
** Added MIT-license
** Date: 31-01-2014
** file limit increased to 4GB

Share.

About Author

13 Comments

  1. Thanks for this awesome library. I have a zipped XML file stored in the database and I am using your package to extract the zip inline and then perform an xquery on it.

    Example:
    TO_DATE(EXTRACT(XMLTYPE(SCHEMA.BLOB_TO_CLOB(SCHEMA.as_zip.GET_FILE(F.PAYLOAD, REPLACE(F.FILENAME,’.zip’,’.xml’), null))),’//Transaction[@transactionID = “‘|| CRN.TRAN_ID ||'”]/CATSNotification/ChangeRequest/ChangeData/ActualChangeDate/text()’).getStringVal(),’YYYY-MM-DD’) AS ACTUALCHANGEDATE

    Thanks again, Jason

  2. Klaus Schuermann on

    Hi Anton,
    I’m using your zip package in Oracle 11g XE. It’s great and very useful.
    But I had some problems with german umlauts in the filename.
    In the zipfile the filename was cut. For each umlaut one or two characters are missing at the end.
    I changed just 1 byte in your code adding a B for length in bytes and now it’s working:

    procedure add1file

    -> dbms_lob.append

    /* -> little_endian( length( p_name ), 2 ) — File name length */
    -> little_endian( lengthb( p_name ), 2 ) — File name length

    Regards
    Klaus
    Klaus

    • Anton Scheffer on

      I’ve changed the procedure add1file a little bit more to give it more support for non-ascii filenames.
      procedure add1file
      ( p_zipped_blob in out blob
      , p_name varchar2
      , p_content blob
      )
      is
      t_now date;
      t_blob blob;
      t_len integer;
      t_clen integer;
      t_crc32 raw(4) := hextoraw( '00000000' );
      t_compressed boolean := false;
      t_name raw(32767);
      begin
      t_now := sysdate;
      t_len := nvl( dbms_lob.getlength( p_content ), 0 );
      if t_len > 0
      then
      t_blob := utl_compress.lz_compress( p_content );
      t_clen := dbms_lob.getlength( t_blob ) - 18;
      t_compressed := t_clen < t_len;
      t_crc32 := dbms_lob.substr( t_blob, 4, t_clen + 11 );
      end if;
      if not t_compressed
      then
      t_clen := t_len;
      t_blob := p_content;
      end if;
      if p_zipped_blob is null
      then
      dbms_lob.createtemporary( p_zipped_blob, true );
      end if;
      t_name := utl_i18n.string_to_raw( p_name, 'AL32UTF8' );
      dbms_lob.append( p_zipped_blob
      , utl_raw.concat( c_LOCAL_FILE_HEADER -- Local file header signature
      , hextoraw( '1400' ) -- version 2.0
      , case when t_name = utl_i18n.string_to_raw( p_name, 'US8PC437' )
      then hextoraw( '0000' ) -- no General purpose bits
      else hextoraw( '0008' ) -- set Language encoding flag (EFS)
      end
      , case when t_compressed
      then hextoraw( '0800' ) -- deflate
      else hextoraw( '0000' ) -- stored
      end
      , little_endian( to_number( to_char( t_now, 'ss' ) ) / 2
      + to_number( to_char( t_now, 'mi' ) ) * 32
      + to_number( to_char( t_now, 'hh24' ) ) * 2048
      , 2
      ) -- File last modification time
      , little_endian( to_number( to_char( t_now, 'dd' ) )
      + to_number( to_char( t_now, 'mm' ) ) * 32
      + ( to_number( to_char( t_now, 'yyyy' ) ) - 1980 ) * 512
      , 2
      ) -- File last modification date
      , t_crc32 -- CRC-32
      , little_endian( t_clen ) -- compressed size
      , little_endian( t_len ) -- uncompressed size
      , little_endian( utl_raw.length( t_name ), 2 ) -- File name length
      , hextoraw( '0000' ) -- Extra field length
      , t_name -- File name
      )
      );
      if t_compressed
      then
      dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 11 ); -- compressed content
      elsif t_clen > 0
      then
      dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 1 ); -- content
      end if;
      if dbms_lob.istemporary( t_blob ) = 1
      then
      dbms_lob.freetemporary( t_blob );
      end if;
      end
      ;

  3. problems using add1file when modifying a docx containing a tiff image. procedure assumes compressed and word stores the file uncompressed.
    Have tried modifying it but local header gets into trouble further down.? help!

  4. It’s amazing how complex docx and doc files can be.  I’ve tried to parse them with Python and they are quite difficult.  Our program does a unix conversion of docx to doc files in batch format.
    Thanks for the post.

  5. Anton Scheffer on

    The double text shown in my example is caused by a bug with XMLTYPE and blobs on my XE database.