Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.
Say you have a Word document like this
Then you can query the text from it like this
As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.
Anton
And here’s a link with the used code: as_zip
(old) package with zip and unzip
And a new version on git
** Changelog:
** Date:30-09-2021
** moved code to git
** deflate64, zip64, Winzip encryption
** Date: 04-08-2016
** fixed endless loop for empty/null zip file
** Date: 28-07-2016
** added support for defate64 (this only works for zip-files created with 7Zip)
** Date: 31-01-2014
** file limit increased to 4GB
** Date: 29-04-2012
** fixed bug for large uncompressed files, thanks Morten Braten
** Date: 21-03-2012
** Take CRC32, compressed length and uncompressed length from
** Central file header instead of Local file header
** Date: 17-02-2012
** Added more support for non-ascii filenames
** Date: 25-01-2012
** Added MIT-license
Thanks for this awesome library. I have a zipped XML file stored in the database and I am using your package to extract the zip inline and then perform an xquery on it.
Example:
TO_DATE(EXTRACT(XMLTYPE(SCHEMA.BLOB_TO_CLOB(SCHEMA.as_zip.GET_FILE(F.PAYLOAD, REPLACE(F.FILENAME,’.zip’,’.xml’), null))),’//Transaction[@transactionID = “‘|| CRN.TRAN_ID ||'”]/CATSNotification/ChangeRequest/ChangeData/ActualChangeDate/text()’).getStringVal(),’YYYY-MM-DD’) AS ACTUALCHANGEDATE
Thanks again, Jason
Thank you for implementing these changes.
Hi Anton,
I’m using your zip package in Oracle 11g XE. It’s great and very useful.
But I had some problems with german umlauts in the filename.
In the zipfile the filename was cut. For each umlaut one or two characters are missing at the end.
I changed just 1 byte in your code adding a B for length in bytes and now it’s working:
…
procedure add1file
…
-> dbms_lob.append
…
/* -> little_endian( length( p_name ), 2 ) — File name length */
-> little_endian( lengthb( p_name ), 2 ) — File name length
Regards
Klaus
Klaus
I’ve changed the procedure add1file a little bit more to give it more support for non-ascii filenames.
procedure add1file
;( p_zipped_blob in out blob
, p_name varchar2
, p_content blob
)
is
t_now date;
t_blob blob;
t_len integer;
t_clen integer;
t_crc32 raw(4) := hextoraw( '00000000' );
t_compressed boolean := false;
t_name raw(32767);
begin
t_now := sysdate;
t_len := nvl( dbms_lob.getlength( p_content ), 0 );
if t_len > 0
then
t_blob := utl_compress.lz_compress( p_content );
t_clen := dbms_lob.getlength( t_blob ) - 18;
t_compressed := t_clen < t_len;
t_crc32 := dbms_lob.substr( t_blob, 4, t_clen + 11 );
end if;
if not t_compressed
then
t_clen := t_len;
t_blob := p_content;
end if;
if p_zipped_blob is null
then
dbms_lob.createtemporary( p_zipped_blob, true );
end if;
t_name := utl_i18n.string_to_raw( p_name, 'AL32UTF8' );
dbms_lob.append( p_zipped_blob
, utl_raw.concat( c_LOCAL_FILE_HEADER -- Local file header signature
, hextoraw( '1400' ) -- version 2.0
, case when t_name = utl_i18n.string_to_raw( p_name, 'US8PC437' )
then hextoraw( '0000' ) -- no General purpose bits
else hextoraw( '0008' ) -- set Language encoding flag (EFS)
end
, case when t_compressed
then hextoraw( '0800' ) -- deflate
else hextoraw( '0000' ) -- stored
end
, little_endian( to_number( to_char( t_now, 'ss' ) ) / 2
+ to_number( to_char( t_now, 'mi' ) ) * 32
+ to_number( to_char( t_now, 'hh24' ) ) * 2048
, 2
) -- File last modification time
, little_endian( to_number( to_char( t_now, 'dd' ) )
+ to_number( to_char( t_now, 'mm' ) ) * 32
+ ( to_number( to_char( t_now, 'yyyy' ) ) - 1980 ) * 512
, 2
) -- File last modification date
, t_crc32 -- CRC-32
, little_endian( t_clen ) -- compressed size
, little_endian( t_len ) -- uncompressed size
, little_endian( utl_raw.length( t_name ), 2 ) -- File name length
, hextoraw( '0000' ) -- Extra field length
, t_name -- File name
)
);
if t_compressed
then
dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 11 ); -- compressed content
elsif t_clen > 0
then
dbms_lob.copy( p_zipped_blob, t_blob, t_clen, dbms_lob.getlength( p_zipped_blob ) + 1, 1 ); -- content
end if;
if dbms_lob.istemporary( t_blob ) = 1
then
dbms_lob.freetemporary( t_blob );
end if;
end
it is useful, thank you
Thank you very much for sharing
problems using add1file when modifying a docx containing a tiff image. procedure assumes compressed and word stores the file uncompressed.
Have tried modifying it but local header gets into trouble further down.? help!
thank you very much for sharing. so unjust for word ???
It’s amazing how complex docx and doc files can be. Â I’ve tried to parse them with Python and they are quite difficult. Â Our program does a unix conversion of docx to doc files in batch format.
Thanks for the post.
The double text shown in my example is caused by a bug with XMLTYPE and blobs on my XE database.
nice, thanks
I could not understand why the words appear twice. but I think of that.
Thanks you very much sir!…