Converting Word documents to XSL-FO (and onwards to PDF)

Lucas Jellema March 22, 2006 Java, Oracle, Oracle WebLogic Server, XML 20 Comments

In the not too distant past, I have had to implement solutions for generating PDF documents, based on dynamic data and a document template to be defined by the end-user. The approach we took was to allow the end user to create the document layout in MS Word, embedding simple tags to indicate the position of dynamic data elements. The Word document had to be saved as HTML, was cleansed into proper XHTML by JTidy was subsequently turned into an XSLT stylesheet that consisted largely of XSL-FO statements with small pieces of XSLT embedded to inject the dynamic data elements. The resulting proper XSL-FO document was finally transformed into PDF using Apache FOP. The XSLT that turned XHTML into XSLT-with-a-lot-of-XSL-FO was really the essence of the solution. Translating HTML into XSL-FO was key.

Yesterday I came across a very interesting article on MSDN, discussing a way of turning Word Document directly into an XSL-FO document: Transforming Word Documents into the XSL-FO Format (feb 2005). It shows how you can save a Word document as XML and specify a stylesheet to be applied when saving. I had not known before that my Word 2003 also was an XSLT processor! The article introduces a stylesheet – Word2FO.xsl – that can be used for the transformation into XSL-FO. I decided to give it a spin.
.

A short intro on XSL-FO

In 2001, before the advent of WordprocessingML, the W3C endorsed an XML formatting language known as XSL Formatting Objects (XSL-FO). XSL-FO is synonymous with eXtensible Stylesheet Language (XSL), one of three recommendations by the W3C’s XSL working group: XSL-T, XPath and XSL-FO. XSL-FO is an intermediate form that results from applying an XSLT style sheet to an XML structured document. The XML-FO form describes how pages appear when presented to a reader, such as a Web browser. Currently, there are no readers that directly interpret an XSL-FO document. To interpret them, you must run them through a formatter, along with other data, such as graphics and font metrics, to create a final displayable or printable file. Possible formats for the resulting file include Adobe’s Portable Document Format (PDF) and Hypertext Markup Language (HTML).

When compared to Cascading Style Sheets (CSS), XSL-FO provides a more sophisticated visual layout model. You can use CSS to apply specific style elements to an XML or HTML document. By contrast, XSL-FO is a language for describing a complete document. It includes everything needed to paginate and format a document. Some of the formatting supported by XSL-FO, but not by CSS, includes right-to-left and top-to-bottom text, footnotes, margin notes, page numbers in cross-references, and more. Note that while CSS is primarily intended for use on the Web, XSL-FO is designed for broader use. As an example, you could use an XSL-FO document to lay out an XML document as a printed book. You could write a completely separate XSL-FO document to transform the same XML document into HTML.

Converting Word documents to XSL-FO

As per the instructions in the MSDN article, I download the Word2FO.xsl stylesheet – with several supporting templates – and install them on my local hard drive. I then open Word 2003, create a simple document with several layout features that are bound to pose a challenge on the conversion to XSL-FO. I then select Save As from the File Menu.
Converting Word documents to XSL-FO (and onwards to PDF)

The document type is XML. This brings up a checkbox Apply Transform. When checked, the Transform button is enabled. When I press it, I can browse the file system to locate an XSL(T) stylesheet. It turns out Word is also an XSLT transformation engine! That was news to me. Well, I select the Word2FO.xsl stylesheet. When finally I press Save, the document is saved, as XSL-FO:
Converting Word documents to XSL-FO (and onwards to PDF)

Thus I have found a very rapid way of turning a layout created by an end-user in Word into XSL-FO format. It is certainly an easy way to find out the XSL-FO syntax for certain layout features: it beats googling for the exact XSL-FO syntax for certain layout properties. It would also give me an interesting alternative for the solution I described above: instead of saving Word as HTML, tiyding it to XHTML and XSLT transforming it into an XSLT stylesheet, riddled with XSL-FO -heavily relying on the custom XHTML-to-XSLFO stylesheet – I could pick up the XSL-FO created by Word and embed it in a prepared XSLT.

Transforming onwards to PDF

I am not able to quickly gather from this XSL-FO content whether the conversion was successful and complete. That is something I would typically leave to XSL-FO render-engines, like Antenna House or Apache FOP (open source). I typically make use of Apache FOP, see http://xmlgraphics.apache.org/fop/, an open source implementation of a FO renderer. FOP can render to PDF as well as SVG, PS, RTF.

The result of rendering the XSL-FO document created by Word as PDF, using the stable 0.20.5 FOP release (July 2003!), looks as follows:
Converting Word documents to XSL-FO (and onwards to PDF)

Well, not bad. However we lost a couple of details. Most notable the column layout in the original Word document. Also a couple of fonts – no arial (or any sans-serif for that matter) is created into the PDF. Of course we cannot tell from this example whether the FOP renderer is limited or the XSL-FO was faulty. I will give it a try with the latest (beta) 0.9 release op FOP (december 2005).

I have tried with FOP 0.91 and the result is very similar. Still no column layout, still no sans-serif font types. I get warnings about the fonts that allow we to correct the situation (it seems that proper, exact matchting font definitions need to be configured with FOP):

WARNING: Warning(1/6184): fo:table, table-layout="auto" is currently not supported by FOP

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'TimesNewRoman,normal,400' not found. Substituting with default font.

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'Arial,normal,700' not found. Substituting with default font.

Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement

WARNING: Font 'Arial,italic,700' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'TimesNewRoman,italic,400' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'TimesNewRoman,italic,700' not found. Substituting with default font.
Mar 22, 2006 11:32:12 AM org.apache.fop.fonts.FontInfo notifyFontReplacement
WARNING: Font 'CourierNew,normal,400' not found. Substituting with default font.

Note that it is very simple to write a Java class that can convert XSL-FO to PDF using Apache FOP. The code looks like this (derived from the example ExampleFO2PDF.java that is shipped with FOP):

package nl.amis.fop092;
/*
 * Copyright 1999-2005 The Apache Software Foundation.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/* $Id: ExampleFO2PDF.java 332791 2005-11-12 15:58:07Z jeremias $ */


// Java
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

//JAXP
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Source;
import javax.xml.transform.Result;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.sax.SAXResult;


// FOP
import org.apache.fop.apps.Fop;
import org.apache.fop.apps.FOPException;
import org.apache.fop.apps.FormattingResults;
import org.apache.fop.apps.MimeConstants;
import org.apache.fop.apps.PageSequenceResults;

/**
 * This class demonstrates the conversion of an FO file to PDF using FOP.
 */
public class ExampleFO2PDF {

    /**
     * Converts an FO file to a PDF file using FOP
     * @param fo the FO file
     * @param pdf the target PDF file
     * @throws IOException In case of an I/O problem
     * @throws FOPException In case of a FOP problem
     */
    public void convertFO2PDF(File fo, File pdf) throws IOException, FOPException {

        OutputStream out = null;

        try {
            // Construct fop with desired output format
            Fop fop = new Fop(MimeConstants.MIME_PDF);

            // Setup output stream.  Note: Using BufferedOutputStream
            // for performance reasons (helpful with FileOutputStreams).
            out = new FileOutputStream(pdf);
            out = new BufferedOutputStream(out);
            fop.setOutputStream(out);

            // Setup JAXP using identity transformer
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer transformer = factory.newTransformer(); // identity transformer

            // Setup input stream
            Source src = new StreamSource(fo);

            // Resulting SAX events (the generated FO) must be piped through to FOP
            Result res = new SAXResult(fop.getDefaultHandler());

            // Start XSLT transformation and FOP processing
            transformer.transform(src, res);

            // Result processing
            FormattingResults foResults = fop.getResults();
            java.util.List pageSequences = foResults.getPageSequences();
            for (java.util.Iterator it = pageSequences.iterator(); it.hasNext();) {
                PageSequenceResults pageSequenceResults = (PageSequenceResults)it.next();
                System.out.println("PageSequence "
                        + (String.valueOf(pageSequenceResults.getID()).length() > 0
                                ? pageSequenceResults.getID() : "  id")
                        + " generated " + pageSequenceResults.getPageCount() + " pages.");
            }
            System.out.println("Generated " + foResults.getPageCount() + " pages in total.");

        } catch (Exception e) {
            e.printStackTrace(System.err);
            System.exit(-1);
        } finally {
            out.close();
        }
    }


    /**
     * Main method.
     * @param args command-line arguments
     */
    public static void main(String[] args) {
        try {
            System.out.println("FOP ExampleFO2PDF\n");
            System.out.println("Preparing...");

            //Setup directories
            File baseDir = new File(".");
            File outDir = new File(baseDir, "out");
            outDir.mkdirs();

            //Setup input and output files
            File fofile = new File(baseDir, "exampleWordbasedXSLFO.fo");
            File pdffile = new File(outDir, "ResultFO2PDF.pdf");

            System.out.println("Input: XSL-FO (" + fofile + ")");
            System.out.println("Output: PDF (" + pdffile + ")");
            System.out.println();
            System.out.println("Transforming...");

            ExampleFO2PDF app = new ExampleFO2PDF();
            app.convertFO2PDF(fofile, pdffile);

            System.out.println("Success!");
        } catch (Exception e) {
            e.printStackTrace(System.err);
            System.exit(-1);
        }
    }
}

Also note that it is good to see that after two years of apparent comatose existence, FOP has regained consciousness with this 0.91 release late 2005!

About The Author

Lucas Jellema

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.

20 Comments

jpee November 30, 2010

Hey nice post,
until now I didn’t noticed that word documents can be safed as xsl:fo. Thanks for that!
Greetings
jpee
Rajesh September 14, 2009

Hey, really great blog. Seriously I made search for a week. Finally I got a way through xsl fo and apache fop apis. Over all is fine learning journey again ends with AMIS Technology blog. Thanks to you all.

Thanks,
Rajesh
Matt May 19, 2008

Hey all, interesting forum. Try googling ooo2xslfo, it’s the Open Office implementation of saving a doc as xsl:fo. I tested similar components in Word, AbiWord and Open Office and found that with Apache FOP the PDF turned out best with the Open Office solution, but it really depends on what you need to do. I think AbiWord comes with the feature built in.
Nicro September 26, 2007

Sorry, mistake. I mean i dont want virtualy printers neither terciary softwares.
Nicro September 26, 2007

I’m looking for some sample code to get PDF from Word, i want virtualy printers or terciary sotfwares. I was thinking about Word–>XSL:FO–>PDF

Do you know how get XSL:FO from Word by code and without open winword process???

No, do you know any other way to get that?

Thanks you.
Fan Timmermans April 28, 2007

I’m looking for a solution for following requirement.
We have RTF-documents loaded into Oracle 9i CLOB on HP-UX. The user wants to view these CLOB via a JEE-application using Acrobat Reader. The printing functionality within the reader will then be used. They also ask for the possibility of loading the PDF (as a result of the RTF conversion) as a BLOB in Oracle 9i.
The best way to do this would be using PL/SQL to convert the CLOB containing RFT to the BLOB containing the PDF.
If this isn’t possible the next solution would be doing the conversion somewhere in the Oracle iAS (JEE).
I didn’t understand Lucas article all the way. But maybe it contains an answer to my question. Could anyone comment on that?
Thanks in advance!
Aasif Kham's November 30, 2006

Hello.. I am not able to open any word file. i have got Windows XP and Office XP installed in my laptop. I have tried repairing, reinstalling and every possible stuff i could. I also have Acrobat Writer 7 installed in it. is their any solution to this? is Acrobat writer linked with the problem?? Kindly suggest some alternative or solution…
Meera May 2, 2006

How can we convert XSL-FO document to HTML and after some format changes can we convert the HTML doc back to XSL-FO without loosing the formatting info ?

Any suggestion/comment on this would be of great help
Joshua Smith April 15, 2006

Great article. Very informative.

Your page displays fine in Safari 1.3.2 on the Mac. You really ought to switch your browser compatability code to test for specific DOM and JavaScript functionality instead of detecting the browser name and version number. Your Javascript alert that says that my browser (which it identifies as Mozilla) might not be supported is annoying and excludes thousands of people that refuse to use MSIE.

Again, great article. Just clean up that browser support code! Thanks.
Jason Brice April 12, 2006

Leo,

The original problem Lucas described is not how to convert a Word Document (or Open Office document) to a PDF, we all agree that’s easy. The problem is how to programmatically create a PDF from a user-defined form. For example, say the legal department wants to come up with a template that the sales people in the field can fill out on the fly, and keep all the nicities like headers, footers, pagination, inline images, and so forth.

One might say to simply use Adobe Acrobat, but solution that has it’s own share of problems: most people are comfortable with a Word or Word-like interface which Acrobat does not have, Acrobat fields do not “grow and shrink” with the text (you have to pre-define the field size, and if the text is longer than what you anticipated, you get to choose either truncation or shrunken font size), variable length tables are difficult to deal with, Acrobat costs several hundred (US) dollars, etc.

This is actually a pretty tough problem. Thanks for the article, Lucas.
Leo April 5, 2006

Why not just open the Word Document in Open Office and click on the “PDF” button ?
Rick Stephens April 4, 2006

You didn’t delve into to this deep enough.

Using XML-FO to get from WordML to PDF works only sort-of. There are many things that fail, because FO doesn’t live in the same world as Word. Try some more complex documents (hint: one with a Table of contents, or one that uses Tabs, or bullets, or line numbers, or …).

If you are going to use this technique, you better research it more. There are a number of commercial companies trying to crack this, and with all of their work, still have problems with true fidelity.
Kishore April 4, 2006

I need to know is this converting word to xls to pdf thingi is free… if so can u give me this software.
kishoreyc@gmail.com.

If its not free.. then make this software freely available.
Nathaniel April 4, 2006

We use OpenOffice.org as well, since OOo 2 came out the support for a huge
amount of Word’s features is quite excellent and for the past year or so
quite stable.

In conjuciton with iText for further PDF work it becomes a very powerful of automating
Doc -> PDF workflow for not too much money and with users having to do anything XML.
Lucas Jellema March 28, 2006

As a matter of fact Tim, we had a Knowledge Center session on XML Publisher just yesterday. It does indeed cover a lot of the functionality we require. My colleague Marcos had prepared demonstrations and clarficiations and I was very impressed.

However, the current pricing & licensing strategy for XML Publisher is very prohibitive. At the current quotes rates of $40.000 – or $30.000 if you already have a Enterprise Edition of
the Oracle Application Server -, we will find few customers interested.

We very much hope prices will come down once the product has mattured somewhat. It seems that Oracle is not ready to do much work on Oracle Reports so XML Publishers
seems to be its logical successor. For that to really be a viable option however, the price simply must come down!

regards,

Lucas
Tim Dexter March 28, 2006

Did you folks check out Oracle XML Publisher, you can use an RTF document as
a report template and not only convert the RTF doc contents but also embed data tags
that are merged in from an XML datasource at runtime. You can then generate PDF,
HTML, RTF and Excel as output formats.
http://www.oracle.com/technology/products/applications/publishing/index.html
Currently available in the Oracle E Business Suite and as a standalone library.
petercr4 March 23, 2006

You could use OpenOffice for conversion on IAS. It has a great PDF renderer and can
also open DOC files. ODF is open to public (zipped XML). So, we made a simple
java program which calls OO service and makes conversion on IAS (kind of odt2pdf).
Catherine Devlin March 22, 2006

I found this document a while ago, while searching for a way to go the other direction – from XSL-FO to Word (really, to .RTF). Ironically, I never did find a handy way to do that – instead, I’m looking at generating PDF from XSL-FO, then using a third-party proprietary tool for PDF->RTF conversion. Clumsy, but it seems to be what I’m stuck with for now. I did try Apache’s FOP with RTF specified as the output format, but got show-stopping cryptic Java errors. That was a few months ago, when the RTF output functionality was new; perhaps it’s been smoothed out since then.
Lucas Jellema March 22, 2006

I had mistakenly used code tags instead of pre. Our WYSISWG editor let us down. Thanks for the hint.
scaamanho March 22, 2006

Great Post, it looks very good, only a little thing… the code will be look more nice
formated, but i supose that was a mistake.

A short intro on XSL-FO

Converting Word documents to XSL-FO

Transforming onwards to PDF

Share this:

Like this:

Related Posts

About The Author

Lucas Jellema

20 Comments