
Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL
Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.
Say you have a Word document like this

Then you can query the text from it like this

As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.
Anton
And here’s a link with the used code: as_zip
package with zip and unzip



12/6/2010 - 1:35 pm
Thanks you very much sir!…
13/6/2010 - 8:50 pm
I could not understand why the words appear twice. but I think of that.
22/6/2010 - 4:36 pm
nice, thanks
5/11/2010 - 9:38 am
The double text shown in my example is caused by a bug with XMLTYPE and blobs on my XE database.
17/11/2010 - 1:51 pm
It’s amazing how complex docx and doc files can be. I’ve tried to parse them with Python and they are quite difficult. Our program does a unix conversion of docx to doc files in batch format.
Thanks for the post.
26/2/2011 - 9:58 am
thank you very much for sharing. so unjust for word ???
6/5/2011 - 2:57 pm
problems using add1file when modifying a docx containing a tiff image. procedure assumes compressed and word stores the file uncompressed.
Have tried modifying it but local header gets into trouble further down.? help!
16/9/2011 - 9:45 am
Thank you very much for sharing
28/9/2011 - 3:15 pm
it is useful, thank you