Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL

Some days ago a collegue of mine asked if I could made something for him to unzip a Microsoft Word 2007 docx file. And of course in the database and without using Java.
As it turns out, a docx file is just a ordinary zipfile, with some xml-files stored in it. And because I already had build a little procedure to make zipfiles some weeks ago it didn’t took me not more than 3 hours to build a package to unzip a zipfile from PL/SQL.
With this package you can get list of all the files in a zipfile, and unzip a file if you want. And if you know a little xml you can query the text form your Word document.

Say you have a Word document like this
Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL demo

Then you can query the text from it like this
Parsing a Microsoft Word docx, and unzip zipfiles, with PL/SQL sql

As you can see the text is shown twice, I didn’t put time in trying to understand the Word format. I leave that to somebody else.

Anton

And here’s a link with the used code: as_zip

(old) package with zip and unzip
And a new version on git
** Changelog:
** Date:30-09-2021
** moved code to git
** deflate64, zip64, Winzip encryption
** Date: 04-08-2016
** fixed endless loop for empty/null zip file
** Date: 28-07-2016
** added support for defate64 (this only works for zip-files created with 7Zip)
** Date: 31-01-2014
** file limit increased to 4GB
** Date: 29-04-2012
** fixed bug for large uncompressed files, thanks Morten Braten
** Date: 21-03-2012
** Take CRC32, compressed length and uncompressed length from
** Central file header instead of Local file header
** Date: 17-02-2012
** Added more support for non-ascii filenames
** Date: 25-01-2012
** Added MIT-license

13 Comments

  1. Jason September 4, 2013
  2. Klaus Schuermann February 22, 2012
  3. Klaus Schuermann February 16, 2012
    • Anton Scheffer February 17, 2012
  4. Blank Names September 28, 2011
  5. mangesh September 16, 2011
  6. maxie May 6, 2011
  7. tercüme February 26, 2011
  8. docx to doc files November 17, 2010
  9. Anton Scheffer November 5, 2010
  10. Microblog June 22, 2010
  11. moda June 13, 2010
  12. prefabrik June 12, 2010