Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL

Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.

Say you have a Word document like this

Then you can query the text from it like this

As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.


And here’s a link with the used code: as_zip

(old) package with zip and unzip
And a new version on git
** Changelog:
** Date:30-09-2021
** moved code to git
** deflate64, zip64, Winzip encryption
** Date: 04-08-2016
** fixed endless loop for empty/null zip file
** Date: 28-07-2016
** added support for defate64 (this only works for zip-files created with 7Zip)
** Date: 31-01-2014
** file limit increased to 4GB
** Date: 29-04-2012
** fixed bug for large uncompressed files, thanks Morten Braten
** Date: 21-03-2012
** Take CRC32, compressed length and uncompressed length from
** Central file header instead of Local file header
** Date: 17-02-2012
** Added more support for non-ascii filenames
** Date: 25-01-2012
** Added MIT-license

13 thoughts on “Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL

  1. Thanks for this awesome library. I have a zipped XML file stored in the database and I am using your package to extract the zip inline and then perform an xquery on it.

    TO_DATE(EXTRACT(XMLTYPE(SCHEMA.BLOB_TO_CLOB(SCHEMA.as_zip.GET_FILE(F.PAYLOAD, REPLACE(F.FILENAME,’.zip’,’.xml’), null))),’//Transaction[@transactionID = “‘|| CRN.TRAN_ID ||'”]/CATSNotification/ChangeRequest/ChangeData/ActualChangeDate/text()’).getStringVal(),’YYYY-MM-DD’) AS ACTUALCHANGEDATE

    Thanks again, Jason

  2. Hi Anton,
    I’m using your zip package in Oracle 11g XE. It’s great and very useful.
    But I had some problems with german umlauts in the filename.
    In the zipfile the filename was cut. For each umlaut one or two characters are missing at the end.
    I changed just 1 byte in your code adding a B for length in bytes and now it’s working:

    procedure add1file

    -> dbms_lob.append

    /* -> little_endian( length( p_name ), 2 ) — File name length */
    -> little_endian( lengthb( p_name ), 2 ) — File name length


    1. I’ve changed the procedure add1file a little bit more to give it more support for non-ascii filenames.
      procedure add1file
      ( p_zipped_blob in out blob
      , p_name varchar2
      , p_content blob
      t_now date;
      t_blob blob;
      t_len integer;
      t_clen integer;
      t_crc32 raw(4) := hextoraw( '00000000' );
      t_compressed boolean := false;
      t_name raw(32767);
      t_now := sysdate;
      t_len := nvl( dbms_lob.getlength( p_content ), 0 );
      if t_len > 0
      t_blob := utl_compress.lz_compress( p_content );
      t_clen := dbms_lob.getlength( t_blob ) - 18;
      t_compressed := t_clen < t_len;
      t_crc32 := dbms_lob.substr( t_blob, 4, t_clen + 11 );
      end if;
      if not t_compressed
      t_clen := t_len;
      t_blob := p_content;
      end if;
      if p_zipped_blob is null
      dbms_lob.createtemporary( p_zipped_blob, true );
      end if;
      t_name := utl_i18n.string_to_raw( p_name, 'AL32UTF8' );
      dbms_lob.append( p_zipped_blob
      , utl_raw.concat( c_LOCAL_FILE_HEADER -- Local file header signature
      , hextoraw( '1400' ) -- version 2.0
      , case when t_name = utl_i18n.string_to_raw( p_name, 'US8PC437' )
      then hextoraw( '0000' ) -- no General purpose bits
      else hextoraw( '0008' ) -- set Language encoding flag (EFS)
      , case when t_compressed
      then hextoraw( '0800' ) -- deflate
      else hextoraw( '0000' ) -- stored
      , little_endian( to_number( to_char( t_now, 'ss' ) ) / 2
      + to_number( to_char( t_now, 'mi' ) ) * 32
      + to_number( to_char( t_now, 'hh24' ) ) * 2048
      , 2
      ) -- File last modification time
      , little_endian( to_number( to_char( t_now, 'dd' ) )
      + to_number( to_char( t_now, 'mm' ) ) * 32
      + ( to_number( to_char( t_now, 'yyyy' ) ) - 1980 ) * 512
      , 2
      ) -- File last modification date
      , t_crc32 -- CRC-32
      , little_endian( t_clen ) -- compressed size
      , little_endian( t_len ) -- uncompressed size
      , little_endian( utl_raw.length( t_name ), 2 ) -- File name length
      , hextoraw( '0000' ) -- Extra field length
      , t_name -- File name
      if t_compressed
      dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 11 ); -- compressed content
      elsif t_clen > 0
      dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 1 ); -- content
      end if;
      if dbms_lob.istemporary( t_blob ) = 1
      dbms_lob.freetemporary( t_blob );
      end if;

  3. problems using add1file when modifying a docx containing a tiff image. procedure assumes compressed and word stores the file uncompressed.
    Have tried modifying it but local header gets into trouble further down.? help!

  4. It’s amazing how complex docx and doc files can be.  I’ve tried to parse them with Python and they are quite difficult.  Our program does a unix conversion of docx to doc files in batch format.
    Thanks for the post.

  5. The double text shown in my example is caused by a bug with XMLTYPE and blobs on my XE database.

Comments are closed.

Next Post

Calling an EJB from a SOA Composite Application using the EJB Binding based on Java Interface

I am currently reworking Chapter 12 for the SOA Suite 11g Handbook. This chapter describes various types of interaction SOA Composite Applications can have with Java applications and components. Since the initial creation of this chapter – some 6 months ago – we have had the Patch Set 2 release […]
%d bloggers like this: