Converting Word documents to XSL-FO (and onwards to PDF)

In the not too distant past, I have had to implement solutions for generating PDF documents, based on dynamic data and a document template to be defined by the end-user. The approach we took was to allow the end user to create the document layout in MS Word, embedding simple tags to indicate the position of dynamic data elements. The Word document had to be saved as HTML, was cleansed into proper XHTML by JTidy was subsequently turned into an XSLT stylesheet that consisted largely of XSL-FO statements with small pieces of XSLT embedded to inject the dynamic data elements. The resulting proper XSL-FO document was finally transformed into PDF using Apache FOP. The XSLT that turned XHTML into XSLT-with-a-lot-of-XSL-FO was really the essence of the solution. Translating HTML into XSL-FO was key.

 

Yesterday I came across a very interesting article on MSDN, discussing a way of turning Word Document directly into an XSL-FO document: Transforming Word Documents into the XSL-FO Format (feb 2005). It shows how you can save a Word document as XML and specify a stylesheet to be applied when saving. I had not known before that my Word 2003 also was an XSLT processor! The article introduces a stylesheet – Word2FO.xsl – that can be used for the transformation into XSL-FO. I decided to give it a spin.
.

A short intro on XSL-FO

In 2001, before the advent of WordprocessingML, the W3C endorsed an XML formatting language known as XSL Formatting Objects (XSL-FO). XSL-FO is synonymous with eXtensible Stylesheet Language (XSL), one of three recommendations by the W3C’s XSL working group: XSL-T, XPath and XSL-FO. XSL-FO is an intermediate form that results from applying an XSLT style sheet to an XML structured document. The XML-FO form describes how pages appear when presented to a reader, such as a Web browser. Currently, there are no readers that directly interpret an XSL-FO document. To interpret them, you must run them through a formatter, along with other data, such as graphics and font metrics, to create a final displayable or printable file. Possible formats for the resulting file include Adobe’s Portable Document Format (PDF) and Hypertext Markup Language (HTML).

When compared to Cascading Style Sheets (CSS), XSL-FO provides a more sophisticated visual layout model. You can use CSS to apply specific style elements to an XML or HTML document. By contrast, XSL-FO is a language for describing a complete document. It includes everything needed to paginate and format a document. Some of the formatting supported by XSL-FO, but not by CSS, includes right-to-left and top-to-bottom text, footnotes, margin notes, page numbers in cross-references, and more. Note that while CSS is primarily intended for use on the Web, XSL-FO is designed for broader use. As an example, you could use an XSL-FO document to lay out an XML document as a printed book. You could write a completely separate XSL-FO document to transform the same XML document into HTML.

Converting Word documents to XSL-FO

As per the instructions in the MSDN article, I download the Word2FO.xsl stylesheet – with several supporting templates – and install them on my local hard drive. I then open Word 2003, create a simple document with several layout features that are bound to pose a challenge on the conversion to XSL-FO. I then select Save As from the File Menu.
Converting Word documents to XSL-FO (and onwards to PDF)

The document type is XML. This brings up a checkbox Apply Transform. When checked, the Transform button is enabled. When I press it, I can browse the file system to locate an XSL(T) stylesheet. It turns out Word is also an XSLT transformation engine! That was news to me. Well, I select the Word2FO.xsl stylesheet. When finally I press Save, the document is saved, as XSL-FO:
Converting Word documents to XSL-FO (and onwards to PDF)

Thus I have found a very rapid way of turning a layout created by an end-user in Word into XSL-FO format. It is certainly an easy way to find out the XSL-FO syntax for certain layout features: it beats googling for the exact XSL-FO syntax for certain layout properties. It would also give me an interesting alternative for the solution I described above: instead of saving Word as HTML, tiyding it to XHTML and XSLT transforming it into an XSLT stylesheet, riddled with XSL-FO -heavily relying on the custom XHTML-to-XSLFO stylesheet – I could pick up the XSL-FO created by Word and embed it in a prepared XSLT.

Transforming onwards to PDF

I am not able to quickly gather from this XSL-FO content whether the conversion was successful and complete. That is something I would typically leave to XSL-FO render-engines, like Antenna House or Apache FOP (open source). I typically make use of Apache FOP, see http://xmlgraphics.apache.org/fop/, an open source implementation of a FO renderer. FOP can render to PDF as well as SVG, PS, RTF.

The result of rendering the XSL-FO document created by Word as PDF, using the stable 0.20.5 FOP release (July 2003!), looks as follows:
Converting Word documents to XSL-FO (and onwards to PDF)

Well, not bad. However we lost a couple of details. Most notable the column layout in the original Word document. Also a couple of fonts – no arial (or any sans-serif for that matter) is created into the PDF. Of course we cannot tell from this example whether the FOP renderer is limited or the XSL-FO was faulty. I will give it a try with the latest (beta) 0.9 release op FOP (december 2005).

I have tried with FOP 0.91 and the result is very similar. Still no column layout, still no sans-serif font types. I get warnings about the fonts that allow we to correct the situation (it seems that proper, exact matchting font definitions need to be configured with FOP):

WARNING: Warning(1/6184): fo:table, table-layout="auto" is currently not supported by FOP

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'TimesNewRoman,normal,400' not found. Substituting with default font.

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'Arial,normal,700' not found. Substituting with default font.

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'Arial,italic,700' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'TimesNewRoman,italic,400' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'TimesNewRoman,italic,700' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'CourierNew,normal,400' not found. Substituting with default font.

Note that it is very simple to write a Java class that can convert XSL-FO to PDF using Apache FOP. The code looks like this (derived from the example ExampleFO2PDF.java that is shipped with FOP):

package nl.amis.fop092;
/*
 * Copyright 1999-2005 The Apache Software Foundation.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/* $Id: ExampleFO2PDF.java 332791 2005-11-12 15:58:07Z jeremias $ */


// Java
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

//JAXP
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Result;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.sax.SAXResult;


// FOP
import org.apache.fop.apps.Fop;
import org.apache.fop.apps.FOPException;
import org.apache.fop.apps.FormattingResults;
import org.apache.fop.apps.MimeConstants;
import org.apache.fop.apps.PageSequenceResults;

/**
 * This class demonstrates the conversion of an FO file to PDF using FOP.
 */
public class ExampleFO2PDF {

    /**
     * Converts an FO file to a PDF file using FOP
     * @param fo the FO file
     * @param pdf the target PDF file
     * @throws IOException In case of an I/O problem
     * @throws FOPException In case of a FOP problem
     */
    public void convertFO2PDF(File fo, File pdf) throws IOException, FOPException {

        OutputStream out = null;

        try {
            // Construct fop with desired output format
            Fop fop = new Fop(MimeConstants.MIME_PDF);

            // Setup output stream.  Note: Using BufferedOutputStream
            // for performance reasons (helpful with FileOutputStreams).
            out = new FileOutputStream(pdf);
            out = new BufferedOutputStream(out);
            fop.setOutputStream(out);

            // Setup JAXP using identity transformer
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer transformer = factory.newTransformer(); // identity transformer

            // Setup input stream
            Source src = new StreamSource(fo);

            // Resulting SAX events (the generated FO) must be piped through to FOP
            Result res = new SAXResult(fop.getDefaultHandler());

            // Start XSLT transformation and FOP processing
            transformer.transform(src, res);

            // Result processing
            FormattingResults foResults = fop.getResults();
            java.util.List pageSequences = foResults.getPageSequences();
            for (java.util.Iterator it = pageSequences.iterator(); it.hasNext();) {
                PageSequenceResults pageSequenceResults = (PageSequenceResults)it.next();
                System.out.println("PageSequence "
                        + (String.valueOf(pageSequenceResults.getID()).length() > 0
                                ? pageSequenceResults.getID() : "  id")
                        + " generated " + pageSequenceResults.getPageCount() + " pages.");
            }
            System.out.println("Generated " + foResults.getPageCount() + " pages in total.");

        } catch (Exception e) {
            e.printStackTrace(System.err);
            System.exit(-1);
        } finally {
            out.close();
        }
    }


    /**
     * Main method.
     * @param args command-line arguments
     */
    public static void main(String[] args) {
        try {
            System.out.println("FOP ExampleFO2PDF\n");
            System.out.println("Preparing...");

            //Setup directories
            File baseDir = new File(".");
            File outDir = new File(baseDir, "out");
            outDir.mkdirs();

            //Setup input and output files
            File fofile = new File(baseDir, "exampleWordbasedXSLFO.fo");
            File pdffile = new File(outDir, "ResultFO2PDF.pdf");

            System.out.println("Input: XSL-FO (" + fofile + ")");
            System.out.println("Output: PDF (" + pdffile + ")");
            System.out.println();
            System.out.println("Transforming...");

            ExampleFO2PDF app = new ExampleFO2PDF();
            app.convertFO2PDF(fofile, pdffile);

            System.out.println("Success!");
        } catch (Exception e) {
            e.printStackTrace(System.err);
            System.exit(-1);
        }
    }
}

Also note that it is good to see that after two years of apparent comatose existence, FOP has regained consciousness with this 0.91 release late 2005!

20 Comments

  1. jpee November 30, 2010
  2. Rajesh September 14, 2009
  3. Matt May 19, 2008
  4. Nicro September 26, 2007
  5. Nicro September 26, 2007
  6. Fan Timmermans April 28, 2007
  7. Aasif Kham's November 30, 2006
  8. Meera May 2, 2006
  9. Joshua Smith April 15, 2006
  10. Jason Brice April 12, 2006
  11. Leo April 5, 2006
  12. Rick Stephens April 4, 2006
  13. Kishore April 4, 2006
  14. Nathaniel April 4, 2006
  15. Lucas Jellema March 28, 2006
  16. Tim Dexter March 28, 2006
  17. petercr4 March 23, 2006
  18. Catherine Devlin March 22, 2006
  19. Lucas Jellema March 22, 2006
  20. scaamanho March 22, 2006