Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.
Say you have a Word document like this
Then you can query the text from it like this
As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.
And here’s a link with the used code: as_zip
(old) package with zip and unzip
And a new version on git
** moved code to git
** deflate64, zip64, Winzip encryption
** Date: 04-08-2016
** fixed endless loop for empty/null zip file
** Date: 28-07-2016
** added support for defate64 (this only works for zip-files created with 7Zip)
** Date: 31-01-2014
** file limit increased to 4GB
** Date: 29-04-2012
** fixed bug for large uncompressed files, thanks Morten Braten
** Date: 21-03-2012
** Take CRC32, compressed length and uncompressed length from
** Central file header instead of Local file header
** Date: 17-02-2012
** Added more support for non-ascii filenames
** Date: 25-01-2012
** Added MIT-license
13 thoughts on “Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL”
Thanks for this awesome library. I have a zipped XML file stored in the database and I am using your package to extract the zip inline and then perform an xquery on it.
TO_DATE(EXTRACT(XMLTYPE(SCHEMA.BLOB_TO_CLOB(SCHEMA.as_zip.GET_FILE(F.PAYLOAD, REPLACE(F.FILENAME,’.zip’,’.xml’), null))),’//Transaction[@transactionID = “‘|| CRN.TRAN_ID ||'”]/CATSNotification/ChangeRequest/ChangeData/ActualChangeDate/text()’).getStringVal(),’YYYY-MM-DD’) AS ACTUALCHANGEDATE
Thanks again, Jason
Thank you for implementing these changes.
I’m using your zip package in Oracle 11g XE. It’s great and very useful.
But I had some problems with german umlauts in the filename.
In the zipfile the filename was cut. For each umlaut one or two characters are missing at the end.
I changed just 1 byte in your code adding a B for length in bytes and now it’s working:
/* -> little_endian( length( p_name ), 2 ) — File name length */
-> little_endian( lengthb( p_name ), 2 ) — File name length
I’ve changed the procedure add1file a little bit more to give it more support for non-ascii filenames.
( p_zipped_blob in out blob
, p_name varchar2
, p_content blob
t_crc32 raw(4) := hextoraw( '00000000' );
t_compressed boolean := false;
t_now := sysdate;
t_len := nvl( dbms_lob.getlength( p_content ), 0 );
if t_len > 0
t_blob := utl_compress.lz_compress( p_content );
t_clen := dbms_lob.getlength( t_blob ) - 18;
t_compressed := t_clen < t_len;
t_crc32 := dbms_lob.substr( t_blob, 4, t_clen + 11 );
if not t_compressed
t_clen := t_len;
t_blob := p_content;
if p_zipped_blob is null
dbms_lob.createtemporary( p_zipped_blob, true );
t_name := utl_i18n.string_to_raw( p_name, 'AL32UTF8' );
, utl_raw.concat( c_LOCAL_FILE_HEADER -- Local file header signature
, hextoraw( '1400' ) -- version 2.0
, case when t_name = utl_i18n.string_to_raw( p_name, 'US8PC437' )
then hextoraw( '0000' ) -- no General purpose bits
else hextoraw( '0008' ) -- set Language encoding flag (EFS)
, case when t_compressed
then hextoraw( '0800' ) -- deflate
else hextoraw( '0000' ) -- stored
, little_endian( to_number( to_char( t_now, 'ss' ) ) / 2
+ to_number( to_char( t_now, 'mi' ) ) * 32
+ to_number( to_char( t_now, 'hh24' ) ) * 2048
) -- File last modification time
, little_endian( to_number( to_char( t_now, 'dd' ) )
+ to_number( to_char( t_now, 'mm' ) ) * 32
+ ( to_number( to_char( t_now, 'yyyy' ) ) - 1980 ) * 512
) -- File last modification date
, t_crc32 -- CRC-32
, little_endian( t_clen ) -- compressed size
, little_endian( t_len ) -- uncompressed size
, little_endian( utl_raw.length( t_name ), 2 ) -- File name length
, hextoraw( '0000' ) -- Extra field length
, t_name -- File name
dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 11 ); -- compressed content
elsif t_clen > 0
dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 1 ); -- content
if dbms_lob.istemporary( t_blob ) = 1
dbms_lob.freetemporary( t_blob );
it is useful, thank you
Thank you very much for sharing
problems using add1file when modifying a docx containing a tiffÂ image. procedure assumes compressed and word stores the file uncompressed.
Have tried modifying it but local header gets into trouble further down.? help!
thank you very much for sharing. so unjust for word ???
It’s amazing how complex docx and doc files can be. Â I’ve tried to parse them with Python and they are quite difficult. Â Our program does a unix conversion of docx to doc files in batch format.
Thanks for the post.
The double text shown in my example is caused by a bug with XMLTYPE and blobs on my XE database.
I could notÂ understand why the words appear twice. but I think of that.
Thanks you very much sir!…
Comments are closed.