Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 1)

0

At AMIS, gaining and sharing information and knowledge is part of our DNA, so that’s why the AMIS Technology Blog has been there for over 15 years now. Together with my colleagues on a regular basis, I write blog articles. From the beginning we used WordPress, but over the years other tools we use have changed and in recent years we use for example Open Live Writer. Although some colleagues start in Open Live Writer, I like however to start in Word (for reasons like reuse, export facilities, printing). But to get all the formatting of the text and mostly code samples correct in the end (in the published blog) is quit some work, because when the text is copied from Word to Open Live Writer you lose some of the formatting, like colors and Text Box borders, etc.

Over the years I already used a helper tool, I wrote in Java, to help me in this process, for example replacing a picture place holder (text like 1.jpg) with the actual File URL used by WordPress (https://technology.amis.nl/wp-content/uploads/2020/01/lameriks_2020_01_1.jpg) including the width and height.

In this article, I will share with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article.

After searching the Internet, I came across a Java tool to convert Word documents to simple and clean HTML, called Mammoth created by Michael Williamson.
[https://github.com/mwilliamson/java-mammoth]

Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There’s a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.

The following features are currently supported:

  • Headings.
  • Lists.
  • Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
  • Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
  • Footnotes and endnotes.
  • Images.
  • Bold, italics, underlines, strikethrough, superscript and subscript.
  • Links.
  • Line breaks.
  • Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
  • Comments.

Because I mostly use Styles, this tool looked appropriate for my goal.

For finding out how it worked I used the Word document of my most recent published article on the AMIS Technology Blog.
[https://technology.amis.nl/]

The readme.md gave me a good starting point. Of course, I first had to figure out how everything works.
From GitHub I downloaded the Java sources zip file: java-mammoth-master.zip
[https://github.com/mwilliamson/java-mammoth.git]

In IntelliJ I created a new Project with dependencies managed by Maven. In the pom.xml I added the dependency mentioned in the readme.md. I also made sure that the correct Java compiler version (see: https://github.com/mwilliamson/java-mammoth/blob/master/pom.xml) is being used:
[in bold, I highlighted the changes]

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>mygroupid1</groupId>
    <artifactId>myproject1</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.zwobble.mammoth</groupId>
            <artifactId>mammoth</artifactId>
            <version>1.4.1</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>

I extracted the sources from the zip file (java-mammoth-master.zip\java-mammoth-master\src\main\java) into the src\main\java directory of my IntelliJ project.

Next I created a Java Class with a main method. Then I copied the example from the readme.md into the main method and fixed the imports that were missing and the unhandled exception.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.File;
import java.io.IOException;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter();
        Result<String> result = converter.convertToHtml(new File("document.docx"));
        String html = result.getValue(); // The generated HTML
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion
    }
}

Then I copied my blog article Word document (renamed to myarticle.docx) in a local directory and changed the Java code accordingly:
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.File;
import java.io.IOException;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion
    }
}

Next, I added code to write the generated HTML to a file (named myarticle.html):
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

Content previous blog article

Before finding out how this converter actually works, first let’s focus a bit on the content of my previous blog article. Here you can see a part of it in my Word document:

If you copy this part of the document in Open Live Writer this looks like:

So, you lose quit some formatting!

In the actual published blog article this looks like:
[https://technology.amis.nl/2020/01/15/rapidly-spinning-up-a-vm-with-ubuntu-and-k3s-with-the-kubernetes-dashboard-on-my-windows-laptop-using-vagrant-and-oracle-virtualbox/]

The HTML used in WordPress looks like:

That’s because I manually made it like that, as I explained at the beginning of this article.

With the current Java code, when I run it, the output is:

And after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):

As you may notice the manually changed HTML from WordPress is very different from the HTML converter output. For example:

  • there are a lot of paragraph tags (<p>)
  • the emphasized text tag <strong> is used for bold font weight
  • the emphasized text tag <em> is used for italic font style
  • text colors like blue or purple are absent

Making changes to the default style mappings

I wanted to tackle some of these differences. Again, the readme.md gave me a good starting point.

By default, Mammoth maps some common .docx styles to HTML elements.

By default, bold text is wrapped in <strong> tags. This behaviour can be changed by adding a style mapping for b.

By default, italic text is wrapped in <em> tags. This behaviour can be changed by adding a style mapping for i.
[https://github.com/mwilliamson/java-mammoth]

In the table below I summarized some of the behavior of the converter with regard to style.

StyleDefault behavior of converterBehavior wanted by meSolution
Boldwrapped in <strong> tagswrapped in <b> tagsDocumentConverter converter = new DocumentConverter()
.addStyleMap(“b => b”);
Italicwrapped in <em> tagswrapped in <i> tagsDocumentConverter converter = new DocumentConverter()
.addStyleMap(“i => i”);
Underlineis ignored since underlining can be confused with links in HTML documentswrapped in <u> tagsDocumentConverter converter = new DocumentConverter()
.addStyleMap(“u => u”);
Strikethroughis wrapped in <s> tagsdefault behavior
Commentsare ignoreddefault behavior

In order to get rid of the paragraph tags, for now I opted for using the replaceAll method on the output of the converter.
I also added code to implement the style mappings mentioned above:
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u");
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

With the current Java code, when I run it, and after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):

Office Open XML

Before diving into the code, it is useful to have some understanding of Office Open XML.

Office Open XML (also informally known as OOXML or Microsoft Open XML (MOX)) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.
[https://en.wikipedia.org/wiki/Office_Open_XML]

The basic document structure of a WordProcessingML document consists of the document and body elements, followed by one or more block level elements such as p, which represents a paragraph. A paragraph contains one or more r elements. The r stands for run, which is a region of text with a common set of properties, such as formatting. A run contains one or more t elements. The t element contains a range of text. The following code example shows the WordprocessingML markup for a document that contains the text “Example text.”

<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <w:body>
        <w:p>
            <w:r>
                <w:t>Example text.</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:document>

[https://docs.microsoft.com/en-us/office/open-xml/how-to-open-and-add-text-to-a-word-processing-document#structure-of-a-wordprocessingml-document]

Finding out how the HTML converter actually works

Via debugging I found out how the HTML converter works. In this article I will only give you a global overview and leave discovering all the details up to your own debugging sessions.

Via the code in the main method, we can tell the starting point is the convertToHtml method from class DocumentConverter.

From there you eventually end up in the following convertToHtml method from class InternalDocumentConverter.

The main steps of the HTML converter are reflected in the code above:

  • Read the Word document (including all the children of its w:body element)
  • Convert the document to a list of HTML elements
  • Write the HTML element to a string via a StringBuilder

And I added to this:

  • Write the string to a file

I will now show some important parts of the code, together with examples of data, received while debugging the code.

Method readElement of class DocumentXMLReader

In the DocumentXmlReader class, the w:body is read and from the children a new Document object is created and returned.

Method readElements of class StatefulBodyXmlReader

From the bodyReader.readElements method you end up in class StatefulBodyXmlReader with a call to the readElements method.

In order to get a feeling about how the converter works, I placed a breakpoint in the readElements method, started debugging and used the following evaluate expression to determine the content of parameter nodes:

((ArrayList) nodes).stream().forEach(System.out::println)

The nodes size is 791. In the IntelliJ console, I checked the result from the evaluate expression.
In the result I search for the text “API requests are tied to either a” and found the XmlElement containing it.

XmlElement(name=w:p, attributes={w:rsidRPr=00313F9F, w:rsidR=0039426C, w:rsidRDefault=0039426C, w:rsidP=00313F9F}, children=[XmlElement(name=w:pPr, attributes={}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])])]), XmlElement(name=w:r, attributes={w:rsidRPr=00313F9F}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])]), XmlElement(name=w:t, attributes={{http://www.w3.org/XML/1998/namespace}space=preserve}, children=[XmlTextNode(value=API requests are tied to either a )])]), XmlElement(name=w:r, attributes={w:rsidRPr=007B6C2E}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:b, attributes={}, children=[]), XmlElement(name=w:bCs, attributes={}, children=[]), XmlElement(name=w:i, attributes={}, children=[]), XmlElement(name=w:color, attributes={w:val=0070C0}, children=[])]), XmlElement(name=w:t, attributes={}, children=[XmlTextNode(value=normal user)])]), XmlElement(name=w:r, attributes={w:rsidRPr=00B626AB}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[]), XmlElement(name=w:color, attributes={w:val=0070C0}, children=[])]), XmlElement(name=w:t, attributes={{http://www.w3.org/XML/1998/namespace}space=preserve}, children=[XmlTextNode(value= )])]), XmlElement(name=w:r, attributes={w:rsidRPr=00313F9F}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])]), XmlElement(name=w:t, attributes={{http://www.w3.org/XML/1998/namespace}space=preserve}, children=[XmlTextNode(value=or a )])]), XmlElement(name=w:r, attributes={w:rsidRPr=007B6C2E}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:b, attributes={}, children=[]), XmlElement(name=w:bCs, attributes={}, children=[]), XmlElement(name=w:i, attributes={}, children=[]), XmlElement(name=w:color, attributes={w:val=7030A0}, children=[])]), XmlElement(name=w:t, attributes={}, children=[XmlTextNode(value=service account)])]), XmlElement(name=w:r, attributes={w:rsidRPr=00313F9F}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])]), XmlElement(name=w:t, attributes={}, children=[XmlTextNode(value=, or are treated as anonymous requests. This means every process inside or outside the cluster, from a human user typing kubectl on a workstation, to kubelets on nodes, to members of the control plane, must authenticate when making requests to the API server, or be treated as an anonymous user.)])])])

Method readElement of class StatefulBodyXmlReader

In this class the mapping of the Office Open XML elements, like paragraph (w:p) and run (w:r) takes place.

As mentioned earlier, the basic document structure of a WordProcessingML document consists of the document and body elements, followed by one or more block level elements such as p, which represents a paragraph. A paragraph contains one or more r elements. The r stands for run, which is a region of text with a common set of properties, such as formatting. A run contains one or more t elements. The t element contains a range of text.

DescriptionXML elementMethod in class StatefulBodyXmlReader
documentw:document
bodyw:body
paragraphw:pReadResult readParagraph(XmlElement element)
runw:rReadResult readRun(XmlElement element)
textw:tReadResult success(DocumentElement element)

So, a Document object is created, containing Paragraph objects, in turn containing Run objects, etc.

Here is an example of a Paragraph object being created:

Here is an example of a Run object being created:

The Run object is important with regard to properties, such as formatting. For example it knows if the children are in Bold, Italic, etc.

Constructor of class DocumentXmlReader

As mentioned before, in the DocumentXmlReader class, the w:body is read and from the children a new Document object is created and returned.

In order to get a feeling about how the converter works, I placed a breakpoint in the Document constructor, started debugging and navigated through the list of Paragraph’s. The children size is 790.

Below you can see a certain Paragraph, having 2 children of type Run, and you can see the Text and if it’s in bold (isBold = true) for example.

The Paragraph shown above relates to the following part in my Word document:

Method write of class Html

As mentioned before in the convertToHtml method from class InternalDocumentConverter the class Html is used, with a call to the write method.

In order to get a feeling about how the converter works, I placed a breakpoint in the write method, started debugging and navigated through the list of HtmlElement’s. The nodes size is 592.

Below you can see a certain HtmlElement, having 2 children of types HtmlTextNode and HtmlElement, and you can see the Text and if it’s in bold (tag / tagNames / 0 = “b”) for example.

Method write of class HtmlWriter

As mentioned before in the write method from class Html the class HtmlWriter is used, with a call to the write method.

In this method the node parameter is transformed to a html begin and end tag containing text.

With the current Java code, when I run it, and after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):

So now it’s time to conclude this article. In this article, I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article. Via debugging I gave a global overview of how the HTML converter (called Mammoth created by Michael Williamson) works. In part 2 of this article I will dive deeper into the code and share with you the changes I made in order to tackle some of the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of my previous blog article as input.

About Author

Marc, active in IT (and with Oracle) since 1995, is a Principal Oracle SOA Consultant with focus on Oracle Cloud, Oracle Service Bus, Oracle SOA Suite, Oracle Database (SQL & PL/SQL) and Java, Docker, Kubernetes, Minikube and Helm. He's Oracle SOA Suite 12c Certified Implementation Specialist. Over the past 20 years he has worked for several customers in the Netherlands. Marc likes to share his knowledge through publications, blog’s and presentations.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.