Converting Word documents to XSL-FO (and onwards to PDF)

20

In the not too distant past, I have had to implement solutions for generating PDF documents, based on dynamic data and a document template to be defined by the end-user. The approach we took was to allow the end user to create the document layout in MS Word, embedding simple tags to indicate the position of dynamic data elements. The Word document had to be saved as HTML, was cleansed into proper XHTML by JTidy was subsequently turned into an XSLT stylesheet that consisted largely of XSL-FO statements with small pieces of XSLT embedded to inject the dynamic data elements. The resulting proper XSL-FO document was finally transformed into PDF using Apache FOP. The XSLT that turned XHTML into XSLT-with-a-lot-of-XSL-FO was really the essence of the solution. Translating HTML into XSL-FO was key.

Yesterday I came across a very interesting article on MSDN, discussing a way of turning Word Document directly into an XSL-FO document: Transforming Word Documents into the XSL-FO Format (feb 2005). It shows how you can save a Word document as XML and specify a stylesheet to be applied when saving. I had not known before that my Word 2003 also was an XSLT processor! The article introduces a stylesheet – Word2FO.xsl – that can be used for the transformation into XSL-FO. I decided to give it a spin.
.

A short intro on XSL-FO

In 2001, before the advent of WordprocessingML, the W3C endorsed an XML formatting language known as XSL Formatting Objects (XSL-FO). XSL-FO is synonymous with eXtensible Stylesheet Language (XSL), one of three recommendations by the W3C’s XSL working group: XSL-T, XPath and XSL-FO. XSL-FO is an intermediate form that results from applying an XSLT style sheet to an XML structured document. The XML-FO form describes how pages appear when presented to a reader, such as a Web browser. Currently, there are no readers that directly interpret an XSL-FO document. To interpret them, you must run them through a formatter, along with other data, such as graphics and font metrics, to create a final displayable or printable file. Possible formats for the resulting file include Adobe’s Portable Document Format (PDF) and Hypertext Markup Language (HTML).

When compared to Cascading Style Sheets (CSS), XSL-FO provides a more sophisticated visual layout model. You can use CSS to apply specific style elements to an XML or HTML document. By contrast, XSL-FO is a language for describing a complete document. It includes everything needed to paginate and format a document. Some of the formatting supported by XSL-FO, but not by CSS, includes right-to-left and top-to-bottom text, footnotes, margin notes, page numbers in cross-references, and more. Note that while CSS is primarily intended for use on the Web, XSL-FO is designed for broader use. As an example, you could use an XSL-FO document to lay out an XML document as a printed book. You could write a completely separate XSL-FO document to transform the same XML document into HTML.

Converting Word documents to XSL-FO

As per the instructions in the MSDN article, I download the Word2FO.xsl stylesheet – with several supporting templates – and install them on my local hard drive. I then open Word 2003, create a simple document with several layout features that are bound to pose a challenge on the conversion to XSL-FO. I then select Save As from the File Menu.

The document type is XML. This brings up a checkbox Apply Transform. When checked, the Transform button is enabled. When I press it, I can browse the file system to locate an XSL(T) stylesheet. It turns out Word is also an XSLT transformation engine! That was news to me. Well, I select the Word2FO.xsl stylesheet. When finally I press Save, the document is saved, as XSL-FO:

Thus I have found a very rapid way of turning a layout created by an end-user in Word into XSL-FO format. It is certainly an easy way to find out the XSL-FO syntax for certain layout features: it beats googling for the exact XSL-FO syntax for certain layout properties. It would also give me an interesting alternative for the solution I described above: instead of saving Word as HTML, tiyding it to XHTML and XSLT transforming it into an XSLT stylesheet, riddled with XSL-FO -heavily relying on the custom XHTML-to-XSLFO stylesheet – I could pick up the XSL-FO created by Word and embed it in a prepared XSLT.

Transforming onwards to PDF

I am not able to quickly gather from this XSL-FO content whether the conversion was successful and complete. That is something I would typically leave to XSL-FO render-engines, like Antenna House or Apache FOP (open source). I typically make use of Apache FOP, see http://xmlgraphics.apache.org/fop/, an open source implementation of a FO renderer. FOP can render to PDF as well as SVG, PS, RTF.

The result of rendering the XSL-FO document created by Word as PDF, using the stable 0.20.5 FOP release (July 2003!), looks as follows:

Well, not bad. However we lost a couple of details. Most notable the column layout in the original Word document. Also a couple of fonts – no arial (or any sans-serif for that matter) is created into the PDF. Of course we cannot tell from this example whether the FOP renderer is limited or the XSL-FO was faulty. I will give it a try with the latest (beta) 0.9 release op FOP (december 2005).

I have tried with FOP 0.91 and the result is very similar. Still no column layout, still no sans-serif font types. I get warnings about the fonts that allow we to correct the situation (it seems that proper, exact matchting font definitions need to be configured with FOP):

WARNING: Warning(1/6184): fo:table, table-layout="auto" is currently not supported by FOP

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'TimesNewRoman,normal,400' not found. Substituting with default font.

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'Arial,normal,700' not found. Substituting with default font.

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'Arial,italic,700' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'TimesNewRoman,italic,400' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'TimesNewRoman,italic,700' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'CourierNew,normal,400' not found. Substituting with default font.

Note that it is very simple to write a Java class that can convert XSL-FO to PDF using Apache FOP. The code looks like this (derived from the example ExampleFO2PDF.java that is shipped with FOP):

package nl.amis.fop092;
/*
 * Copyright 1999-2005 The Apache Software Foundation.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/* $Id: ExampleFO2PDF.java 332791 2005-11-12 15:58:07Z jeremias $ */


// Java
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

//JAXP
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Result;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.sax.SAXResult;


// FOP
import org.apache.fop.apps.Fop;
import org.apache.fop.apps.FOPException;
import org.apache.fop.apps.FormattingResults;
import org.apache.fop.apps.MimeConstants;
import org.apache.fop.apps.PageSequenceResults;

/**
 * This class demonstrates the conversion of an FO file to PDF using FOP.
 */
public class ExampleFO2PDF {

    /**
     * Converts an FO file to a PDF file using FOP
     * @param fo the FO file
     * @param pdf the target PDF file
     * @throws IOException In case of an I/O problem
     * @throws FOPException In case of a FOP problem
     */
    public void convertFO2PDF(File fo, File pdf) throws IOException, FOPException {

        OutputStream out = null;

        try {
            // Construct fop with desired output format
            Fop fop = new Fop(MimeConstants.MIME_PDF);

            // Setup output stream.  Note: Using BufferedOutputStream
            // for performance reasons (helpful with FileOutputStreams).
            out = new FileOutputStream(pdf);
            out = new BufferedOutputStream(out);
            fop.setOutputStream(out);

            // Setup JAXP using identity transformer
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer transformer = factory.newTransformer(); // identity transformer

            // Setup input stream
            Source src = new StreamSource(fo);

            // Resulting SAX events (the generated FO) must be piped through to FOP
            Result res = new SAXResult(fop.getDefaultHandler());

            // Start XSLT transformation and FOP processing
            transformer.transform(src, res);

            // Result processing
            FormattingResults foResults = fop.getResults();
            java.util.List pageSequences = foResults.getPageSequences();
            for (java.util.Iterator it = pageSequences.iterator(); it.hasNext();) {
                PageSequenceResults pageSequenceResults = (PageSequenceResults)it.next();
                System.out.println("PageSequence "
                        + (String.valueOf(pageSequenceResults.getID()).length() > 0
                                ? pageSequenceResults.getID() : "  id")
                        + " generated " + pageSequenceResults.getPageCount() + " pages.");
            }
            System.out.println("Generated " + foResults.getPageCount() + " pages in total.");

        } catch (Exception e) {
            e.printStackTrace(System.err);
            System.exit(-1);
        } finally {
            out.close();
        }
    }


    /**
     * Main method.
     * @param args command-line arguments
     */
    public static void main(String[] args) {
        try {
            System.out.println("FOP ExampleFO2PDF\n");
            System.out.println("Preparing...");

            //Setup directories
            File baseDir = new File(".");
            File outDir = new File(baseDir, "out");
            outDir.mkdirs();

            //Setup input and output files
            File fofile = new File(baseDir, "exampleWordbasedXSLFO.fo");
            File pdffile = new File(outDir, "ResultFO2PDF.pdf");

            System.out.println("Input: XSL-FO (" + fofile + ")");
            System.out.println("Output: PDF (" + pdffile + ")");
            System.out.println();
            System.out.println("Transforming...");

            ExampleFO2PDF app = new ExampleFO2PDF();
            app.convertFO2PDF(fofile, pdffile);

            System.out.println("Success!");
        } catch (Exception e) {
            e.printStackTrace(System.err);
            System.exit(-1);
        }
    }
}

Also note that it is good to see that after two years of apparent comatose existence, FOP has regained consciousness with this 0.91 release late 2005!

Share.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director for Fusion Middleware. Consultant, trainer and instructor on diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, BPM, ADF, Java in various shapes and forms and many other things. Author of the Oracle Press book: Oracle SOA Suite 11g Handbook. Frequent presenter on conferences such as JavaOne, Oracle OpenWorld, ODTUG Kaleidoscope, Devoxx and OBUG. Presenter for Oracle University Celebrity specials.

20 Comments

  1. Hey nice post,
    until now I didn’t noticed that word documents can be safed as xsl:fo. Thanks for that!
    Greetings
    jpee

  2. Hey, really great blog. Seriously I made search for a week. Finally I got a way through xsl fo and apache fop apis. Over all is fine learning journey again ends with AMIS Technology blog. Thanks to you all.

    Thanks,
    Rajesh

  3. Hey all, interesting forum. Try googling ooo2xslfo, it’s the Open Office implementation of saving a doc as xsl:fo. I tested similar components in Word, AbiWord and Open Office and found that with Apache FOP the PDF turned out best with the Open Office solution, but it really depends on what you need to do. I think AbiWord comes with the feature built in.

  4. I’m looking for some sample code to get PDF from Word, i want virtualy printers or terciary sotfwares. I was thinking about Word–>XSL:FO–>PDF

    Do you know how get XSL:FO from Word by code and without open winword process???

    No, do you know any other way to get that?

    Thanks you.

  5. I’m looking for a solution for following requirement.
    We have RTF-documents loaded into Oracle 9i CLOB on HP-UX. The user wants to view these CLOB via a JEE-application using Acrobat Reader. The printing functionality within the reader will then be used. They also ask for the possibility of loading the PDF (as a result of the RTF conversion) as a BLOB in Oracle 9i.
    The best way to do this would be using PL/SQL to convert the CLOB containing RFT to the BLOB containing the PDF.
    If this isn’t possible the next solution would be doing the conversion somewhere in the Oracle iAS (JEE).
    I didn’t understand Lucas article all the way. But maybe it contains an answer to my question. Could anyone comment on that?
    Thanks in advance!

  6. Hello.. I am not able to open any word file. i have got Windows XP and Office XP installed in my laptop. I have tried repairing, reinstalling and every possible stuff i could. I also have Acrobat Writer 7 installed in it. is their any solution to this? is Acrobat writer linked with the problem?? Kindly suggest some alternative or solution…

  7. How can we convert XSL-FO document to HTML and after some format changes can we convert the HTML doc back to XSL-FO without loosing the formatting info ?

    Any suggestion/comment on this would be of great help

  8. Joshua Smith on

    Great article. Very informative.

    Your page displays fine in Safari 1.3.2 on the Mac. You really ought to switch your browser compatability code to test for specific DOM and JavaScript functionality instead of detecting the browser name and version number. Your Javascript alert that says that my browser (which it identifies as Mozilla) might not be supported is annoying and excludes thousands of people that refuse to use MSIE.

    Again, great article. Just clean up that browser support code! Thanks.

  9. Leo,

    The original problem Lucas described is not how to convert a Word Document (or Open Office document) to a PDF, we all agree that’s easy. The problem is how to programmatically create a PDF from a user-defined form. For example, say the legal department wants to come up with a template that the sales people in the field can fill out on the fly, and keep all the nicities like headers, footers, pagination, inline images, and so forth.

    One might say to simply use Adobe Acrobat, but solution that has it’s own share of problems: most people are comfortable with a Word or Word-like interface which Acrobat does not have, Acrobat fields do not “grow and shrink” with the text (you have to pre-define the field size, and if the text is longer than what you anticipated, you get to choose either truncation or shrunken font size), variable length tables are difficult to deal with, Acrobat costs several hundred (US) dollars, etc.

    This is actually a pretty tough problem. Thanks for the article, Lucas.

  10. Why not just open the Word Document in Open Office and click on the “PDF” button ?

  11. Rick Stephens on

    You didn’t delve into to this deep enough.

    Using XML-FO to get from WordML to PDF works only sort-of. There are many things that fail, because FO doesn’t live in the same world as Word. Try some more complex documents (hint: one with a Table of contents, or one that uses Tabs, or bullets, or line numbers, or …).

    If you are going to use this technique, you better research it more. There are a number of commercial companies trying to crack this, and with all of their work, still have problems with true fidelity.

  12. We use OpenOffice.org as well, since OOo 2 came out the support for a huge
    amount of Word’s features is quite excellent and for the past year or so
    quite stable.

    In conjuciton with iText for further PDF work it becomes a very powerful of automating
    Doc -> PDF workflow for not too much money and with users having to do anything XML.

  13. As a matter of fact Tim, we had a Knowledge Center session on XML Publisher just yesterday. It does indeed cover a lot of the functionality we require. My colleague Marcos had prepared demonstrations and clarficiations and I was very impressed.

    However, the current pricing & licensing strategy for XML Publisher is very prohibitive. At the current quotes rates of $40.000 – or $30.000 if you already have a Enterprise Edition of
    the Oracle Application Server -, we will find few customers interested.

    We very much hope prices will come down once the product has mattured somewhat. It seems that Oracle is not ready to do much work on Oracle Reports so XML Publishers
    seems to be its logical successor. For that to really be a viable option however, the price simply must come down!

    regards,

    Lucas

  14. You could use OpenOffice for conversion on IAS. It has a great PDF renderer and can
    also open DOC files. ODF is open to public (zipped XML). So, we made a simple
    java program which calls OO service and makes conversion on IAS (kind of odt2pdf).

  15. I found this document a while ago, while searching for a way to go the other direction – from XSL-FO to Word (really, to .RTF). Ironically, I never did find a handy way to do that – instead, I’m looking at generating PDF from XSL-FO, then using a third-party proprietary tool for PDF->RTF conversion. Clumsy, but it seems to be what I’m stuck with for now. I did try Apache’s FOP with RTF specified as the output format, but got show-stopping cryptic Java errors. That was a few months ago, when the RTF output functionality was new; perhaps it’s been smoothed out since then.

  16. Great Post, it looks very good, only a little thing… the code will be look more nice
    formated, but i supose that was a mistake.