Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL

Anton Scheffer June 9, 2010 Database, IT, PL/SQL, XML 13 Comments

Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.

Say you have a Word document like this

Then you can query the text from it like this

As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.

Anton

And here’s a link with the used code: as_zip

(old) package with zip and unzip
And a new version on git
** Changelog:
** Date:30-09-2021
** moved code to git
** deflate64, zip64, Winzip encryption
** Date: 04-08-2016
** fixed endless loop for empty/null zip file
** Date: 28-07-2016
** added support for defate64 (this only works for zip-files created with 7Zip)
** Date: 31-01-2014
** file limit increased to 4GB
** Date: 29-04-2012
** fixed bug for large uncompressed files, thanks Morten Braten
** Date: 21-03-2012
** Take CRC32, compressed length and uncompressed length from
** Central file header instead of Local file header
** Date: 17-02-2012
** Added more support for non-ascii filenames
** Date: 25-01-2012
** Added MIT-license

Tags:as_zip, plsql

About The Author

Anton Scheffer

Oracle Consultant at AMIS

13 Comments

Jason September 4, 2013

Thanks for this awesome library. I have a zipped XML file stored in the database and I am using your package to extract the zip inline and then perform an xquery on it.

Example:
TO_DATE(EXTRACT(XMLTYPE(SCHEMA.BLOB_TO_CLOB(SCHEMA.as_zip.GET_FILE(F.PAYLOAD, REPLACE(F.FILENAME,’.zip’,’.xml’), null))),’//Transaction[@transactionID = “‘|| CRN.TRAN_ID ||'”]/CATSNotification/ChangeRequest/ChangeData/ActualChangeDate/text()’).getStringVal(),’YYYY-MM-DD’) AS ACTUALCHANGEDATE

Thanks again, Jason
Klaus Schuermann February 22, 2012

Thank you for implementing these changes.
Klaus Schuermann February 16, 2012

Hi Anton,
I’m using your zip package in Oracle 11g XE. It’s great and very useful.
But I had some problems with german umlauts in the filename.
In the zipfile the filename was cut. For each umlaut one or two characters are missing at the end.
I changed just 1 byte in your code adding a B for length in bytes and now it’s working:
…
procedure add1file
…
-> dbms_lob.append
…
/* -> little_endian( length( p_name ), 2 ) — File name length */
-> little_endian( lengthb( p_name ), 2 ) — File name length

Regards
Klaus
Klaus
- Anton Scheffer February 17, 2012
  
  I’ve changed the procedure add1file a little bit more to give it more support for non-ascii filenames.
  procedure add1file ( p_zipped_blob in out blob , p_name varchar2 , p_content blob ) is t_now date; t_blob blob; t_len integer; t_clen integer; t_crc32 raw(4) := hextoraw( '00000000' ); t_compressed boolean := false; t_name raw(32767); begin t_now := sysdate; t_len := nvl( dbms_lob.getlength( p_content ), 0 ); if t_len > 0 then t_blob := utl_compress.lz_compress( p_content ); t_clen := dbms_lob.getlength( t_blob ) - 18; t_compressed := t_clen < t_len; t_crc32 := dbms_lob.substr( t_blob, 4, t_clen + 11 ); end if; if not t_compressed then t_clen := t_len; t_blob := p_content; end if; if p_zipped_blob is null then dbms_lob.createtemporary( p_zipped_blob, true ); end if; t_name := utl_i18n.string_to_raw( p_name, 'AL32UTF8' ); dbms_lob.append( p_zipped_blob , utl_raw.concat( c_LOCAL_FILE_HEADER -- Local file header signature , hextoraw( '1400' ) -- version 2.0 , case when t_name = utl_i18n.string_to_raw( p_name, 'US8PC437' ) then hextoraw( '0000' ) -- no General purpose bits else hextoraw( '0008' ) -- set Language encoding flag (EFS) end , case when t_compressed then hextoraw( '0800' ) -- deflate else hextoraw( '0000' ) -- stored end , little_endian( to_number( to_char( t_now, 'ss' ) ) / 2 + to_number( to_char( t_now, 'mi' ) ) * 32 + to_number( to_char( t_now, 'hh24' ) ) * 2048 , 2 ) -- File last modification time , little_endian( to_number( to_char( t_now, 'dd' ) ) + to_number( to_char( t_now, 'mm' ) ) * 32 + ( to_number( to_char( t_now, 'yyyy' ) ) - 1980 ) * 512 , 2 ) -- File last modification date , t_crc32 -- CRC-32 , little_endian( t_clen ) -- compressed size , little_endian( t_len ) -- uncompressed size , little_endian( utl_raw.length( t_name ), 2 ) -- File name length , hextoraw( '0000' ) -- Extra field length , t_name -- File name ) ); if t_compressed then dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 11 ); -- compressed content elsif t_clen > 0 then dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 1 ); -- content end if; if dbms_lob.istemporary( t_blob ) = 1 then dbms_lob.freetemporary( t_blob ); end if; end;
Blank Names September 28, 2011

it is useful, thank you
mangesh September 16, 2011

Thank you very much for sharing
maxie May 6, 2011

problems using add1file when modifying a docx containing a tiffÂ image. procedure assumes compressed and word stores the file uncompressed.
Have tried modifying it but local header gets into trouble further down.? help!
tercÃ¼me February 26, 2011

thank you very much for sharing. so unjust for word ???
docx to doc files November 17, 2010

It’s amazing how complex docx and doc files can be. Â I’ve tried to parse them with Python and they are quite difficult. Â Our program does a unix conversion of docx to doc files in batch format.
Thanks for the post.
Anton Scheffer November 5, 2010

The double text shown in my example is caused by a bug with XMLTYPE and blobs on my XE database.
Microblog June 22, 2010

nice, thanks
moda June 13, 2010

I could notÂ understand why the words appear twice. but I think of that.
prefabrik June 12, 2010

Thanks you very much sir!…

Share this:

Like this:

Related Posts

About The Author

Anton Scheffer

13 Comments