At AMIS, gaining and sharing information and knowledge is part of our DNA, so that’s why the AMIS Technology Blog has been there for over 15 years now. Together with my colleagues on a regular basis, I write blog articles. From the beginning we used WordPress, but over the years other tools we use have changed and in recent years we use for example Open Live Writer. Although some colleagues start in Open Live Writer, I like however to start in Word (for reasons like reuse, export facilities, printing). But to get all the formatting of the text and mostly code samples correct in the end (in the published blog) is quit some work, because when the text is copied from Word to Open Live Writer you lose some of the formatting, like colors and Text Box borders, etc.
Over the years I already used a helper tool, I wrote in Java, to help me in this process, for example replacing a picture place holder (text like 1.jpg) with the actual File URL used by WordPress (https://technology.amis.nl/wp-content/uploads/2020/01/lameriks_2020_01_1.jpg) including the width and height.
In this article, I will share with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article.
After searching the Internet, I came across a Java tool to convert Word documents to simple and clean HTML, called Mammoth created by Michael Williamson.
[https://github.com/mwilliamson/java-mammoth]
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There’s a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
The following features are currently supported:
- Headings.
- Lists.
- Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
- Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
- Footnotes and endnotes.
- Images.
- Bold, italics, underlines, strikethrough, superscript and subscript.
- Links.
- Line breaks.
- Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
- Comments.
Because I mostly use Styles, this tool looked appropriate for my goal.
For finding out how it worked I used the Word document of my most recent published article on the AMIS Technology Blog.
[https://technology.amis.nl/]
The readme.md gave me a good starting point. Of course, I first had to figure out how everything works.
From GitHub I downloaded the Java sources zip file: java-mammoth-master.zip
[https://github.com/mwilliamson/java-mammoth.git]
In IntelliJ I created a new Project with dependencies managed by Maven. In the pom.xml I added the dependency mentioned in the readme.md. I also made sure that the correct Java compiler version (see: https://github.com/mwilliamson/java-mammoth/blob/master/pom.xml) is being used:
[in bold, I highlighted the changes]
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>mygroupid1</groupId> <artifactId>myproject1</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.zwobble.mammoth</groupId> <artifactId>mammoth</artifactId> <version>1.4.1</version> </dependency> </dependencies> <build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.3</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> </project>
I extracted the sources from the zip file (java-mammoth-master.zip\java-mammoth-master\src\main\java) into the src\main\java directory of my IntelliJ project.
Next I created a Java Class with a main method. Then I copied the example from the readme.md into the main method and fixed the imports that were missing and the unhandled exception.
[in bold, I highlighted the changes]
import org.zwobble.mammoth.DocumentConverter; import org.zwobble.mammoth.Result; import java.io.File; import java.io.IOException; import java.util.Set; public class MyClass1 { public static void main(String[] args) throws IOException { DocumentConverter converter = new DocumentConverter(); Result<String> result = converter.convertToHtml(new File("document.docx")); String html = result.getValue(); // The generated HTML Set<String> warnings = result.getWarnings(); // Any warnings during conversion } }
Then I copied my blog article Word document (renamed to myarticle.docx) in a local directory and changed the Java code accordingly:
[in bold, I highlighted the changes]
import org.zwobble.mammoth.DocumentConverter; import org.zwobble.mammoth.Result; import java.io.File; import java.io.IOException; import java.util.Set; public class MyClass1 { public static void main(String[] args) throws IOException { DocumentConverter converter = new DocumentConverter(); Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx")); String html = result.getValue(); // The generated HTML Set<String> warnings = result.getWarnings(); // Any warnings during conversion } }
Next, I added code to write the generated HTML to a file (named myarticle.html):
[in bold, I highlighted the changes]
import org.zwobble.mammoth.DocumentConverter; import org.zwobble.mammoth.Result; import java.io.BufferedWriter; import java.io.File; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.Set; public class MyClass1 { public static void main(String[] args) throws IOException { DocumentConverter converter = new DocumentConverter(); Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx")); String html = result.getValue(); // The generated HTML Set<String> warnings = result.getWarnings(); // Any warnings during conversion Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html"); try (BufferedWriter writer = Files.newBufferedWriter(path)) { writer.write(html); } } }
Content previous blog article
Before finding out how this converter actually works, first let’s focus a bit on the content of my previous blog article. Here you can see a part of it in my Word document:
If you copy this part of the document in Open Live Writer this looks like:
So, you lose quit some formatting!
In the actual published blog article this looks like:
[https://technology.amis.nl/2020/01/15/rapidly-spinning-up-a-vm-with-ubuntu-and-k3s-with-the-kubernetes-dashboard-on-my-windows-laptop-using-vagrant-and-oracle-virtualbox/]
The HTML used in WordPress looks like:
That’s because I manually made it like that, as I explained at the beginning of this article.
With the current Java code, when I run it, the output is:
And after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):
As you may notice the manually changed HTML from WordPress is very different from the HTML converter output. For example:
- there are a lot of paragraph tags (<p>)
- the emphasized text tag <strong> is used for bold font weight
- the emphasized text tag <em> is used for italic font style
- text colors like blue or purple are absent
Making changes to the default style mappings
I wanted to tackle some of these differences. Again, the readme.md gave me a good starting point.
By default, Mammoth maps some common .docx styles to HTML elements.
…
By default, bold text is wrapped in <strong> tags. This behaviour can be changed by adding a style mapping for b.
…
By default, italic text is wrapped in <em> tags. This behaviour can be changed by adding a style mapping for i.
[https://github.com/mwilliamson/java-mammoth]
In the table below I summarized some of the behavior of the converter with regard to style.
Style | Default behavior of converter | Behavior wanted by me | Solution |
Bold | wrapped in <strong> tags | wrapped in <b> tags | DocumentConverter converter = new DocumentConverter() .addStyleMap(“b => b”); |
Italic | wrapped in <em> tags | wrapped in <i> tags | DocumentConverter converter = new DocumentConverter() .addStyleMap(“i => i”); |
Underline | is ignored since underlining can be confused with links in HTML documents | wrapped in <u> tags | DocumentConverter converter = new DocumentConverter() .addStyleMap(“u => u”); |
Strikethrough | is wrapped in <s> tags | default behavior | – |
Comments | are ignored | default behavior | – |
In order to get rid of the paragraph tags, for now I opted for using the replaceAll method on the output of the converter.
I also added code to implement the style mappings mentioned above:
[in bold, I highlighted the changes]
import org.zwobble.mammoth.DocumentConverter; import org.zwobble.mammoth.Result; import java.io.BufferedWriter; import java.io.File; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.Set; public class MyClass1 { public static void main(String[] args) throws IOException { DocumentConverter converter = new DocumentConverter() .addStyleMap("b => b") .addStyleMap("i => i") .addStyleMap("u => u"); Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx")); String html = result.getValue(); // The generated HTML html = html.replaceAll("<p>", ""); html = html.replaceAll("</p>", "\n"); Set<String> warnings = result.getWarnings(); // Any warnings during conversion Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html"); try (BufferedWriter writer = Files.newBufferedWriter(path)) { writer.write(html); } } }
With the current Java code, when I run it, and after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):
Office Open XML
Before diving into the code, it is useful to have some understanding of Office Open XML.
Office Open XML (also informally known as OOXML or Microsoft Open XML (MOX)) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.
[https://en.wikipedia.org/wiki/Office_Open_XML]
The basic document structure of a WordProcessingML document consists of the document and body elements, followed by one or more block level elements such as p, which represents a paragraph. A paragraph contains one or more r elements. The r stands for run, which is a region of text with a common set of properties, such as formatting. A run contains one or more t elements. The t element contains a range of text. The following code example shows the WordprocessingML markup for a document that contains the text “Example text.”
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:body> <w:p> <w:r> <w:t>Example text.</w:t> </w:r> </w:p> </w:body> </w:document>
Finding out how the HTML converter actually works
Via debugging I found out how the HTML converter works. In this article I will only give you a global overview and leave discovering all the details up to your own debugging sessions.
Via the code in the main method, we can tell the starting point is the convertToHtml method from class DocumentConverter.
From there you eventually end up in the following convertToHtml method from class InternalDocumentConverter.
The main steps of the HTML converter are reflected in the code above:
- Read the Word document (including all the children of its w:body element)
- Convert the document to a list of HTML elements
- Write the HTML element to a string via a StringBuilder
And I added to this:
- Write the string to a file
I will now show some important parts of the code, together with examples of data, received while debugging the code.
Method readElement of class DocumentXMLReader
In the DocumentXmlReader class, the w:body is read and from the children a new Document object is created and returned.
Method readElements of class StatefulBodyXmlReader
From the bodyReader.readElements method you end up in class StatefulBodyXmlReader with a call to the readElements method.
In order to get a feeling about how the converter works, I placed a breakpoint in the readElements method, started debugging and used the following evaluate expression to determine the content of parameter nodes:
((ArrayList) nodes).stream().forEach(System.out::println)
The nodes size is 791. In the IntelliJ console, I checked the result from the evaluate expression.
In the result I search for the text “API requests are tied to either a” and found the XmlElement containing it.
XmlElement(name=w:p, attributes={w:rsidRPr=00313F9F, w:rsidR=0039426C, w:rsidRDefault=0039426C, w:rsidP=00313F9F}, children=[XmlElement(name=w:pPr, attributes={}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])])]), XmlElement(name=w:r, attributes={w:rsidRPr=00313F9F}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])]), XmlElement(name=w:t, attributes={{http://www.w3.org/XML/1998/namespace}space=preserve}, children=[XmlTextNode(value=API requests are tied to either a )])]), XmlElement(name=w:r, attributes={w:rsidRPr=007B6C2E}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:b, attributes={}, children=[]), XmlElement(name=w:bCs, attributes={}, children=[]), XmlElement(name=w:i, attributes={}, children=[]), XmlElement(name=w:color, attributes={w:val=0070C0}, children=[])]), XmlElement(name=w:t, attributes={}, children=[XmlTextNode(value=normal user)])]), XmlElement(name=w:r, attributes={w:rsidRPr=00B626AB}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[]), XmlElement(name=w:color, attributes={w:val=0070C0}, children=[])]), XmlElement(name=w:t, attributes={{http://www.w3.org/XML/1998/namespace}space=preserve}, children=[XmlTextNode(value= )])]), XmlElement(name=w:r, attributes={w:rsidRPr=00313F9F}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])]), XmlElement(name=w:t, attributes={{http://www.w3.org/XML/1998/namespace}space=preserve}, children=[XmlTextNode(value=or a )])]), XmlElement(name=w:r, attributes={w:rsidRPr=007B6C2E}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:b, attributes={}, children=[]), XmlElement(name=w:bCs, attributes={}, children=[]), XmlElement(name=w:i, attributes={}, children=[]), XmlElement(name=w:color, attributes={w:val=7030A0}, children=[])]), XmlElement(name=w:t, attributes={}, children=[XmlTextNode(value=service account)])]), XmlElement(name=w:r, attributes={w:rsidRPr=00313F9F}, children=[XmlElement(name=w:rPr, attributes={}, children=[XmlElement(name=w:i, attributes={}, children=[])]), XmlElement(name=w:t, attributes={}, children=[XmlTextNode(value=, or are treated as anonymous requests. This means every process inside or outside the cluster, from a human user typing kubectl on a workstation, to kubelets on nodes, to members of the control plane, must authenticate when making requests to the API server, or be treated as an anonymous user.)])])])
Method readElement of class StatefulBodyXmlReader
In this class the mapping of the Office Open XML elements, like paragraph (w:p) and run (w:r) takes place.
As mentioned earlier, the basic document structure of a WordProcessingML document consists of the document and body elements, followed by one or more block level elements such as p, which represents a paragraph. A paragraph contains one or more r elements. The r stands for run, which is a region of text with a common set of properties, such as formatting. A run contains one or more t elements. The t element contains a range of text.
Description | XML element | Method in class StatefulBodyXmlReader |
document | w:document | |
body | w:body | |
paragraph | w:p | ReadResult readParagraph(XmlElement element) |
run | w:r | ReadResult readRun(XmlElement element) |
text | w:t | ReadResult success(DocumentElement element) |
So, a Document object is created, containing Paragraph objects, in turn containing Run objects, etc.
Here is an example of a Paragraph object being created:
Here is an example of a Run object being created:
The Run object is important with regard to properties, such as formatting. For example it knows if the children are in Bold, Italic, etc.
Constructor of class DocumentXmlReader
As mentioned before, in the DocumentXmlReader class, the w:body is read and from the children a new Document object is created and returned.
In order to get a feeling about how the converter works, I placed a breakpoint in the Document constructor, started debugging and navigated through the list of Paragraph’s. The children size is 790.
Below you can see a certain Paragraph, having 2 children of type Run, and you can see the Text and if it’s in bold (isBold = true) for example.
The Paragraph shown above relates to the following part in my Word document:
Method write of class Html
As mentioned before in the convertToHtml method from class InternalDocumentConverter the class Html is used, with a call to the write method.
In order to get a feeling about how the converter works, I placed a breakpoint in the write method, started debugging and navigated through the list of HtmlElement’s. The nodes size is 592.
Below you can see a certain HtmlElement, having 2 children of types HtmlTextNode and HtmlElement, and you can see the Text and if it’s in bold (tag / tagNames / 0 = “b”) for example.
Method write of class HtmlWriter
As mentioned before in the write method from class Html the class HtmlWriter is used, with a call to the write method.
In this method the node parameter is transformed to a html begin and end tag containing text.
With the current Java code, when I run it, and after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):
So now it’s time to conclude this article. In this article, I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article. Via debugging I gave a global overview of how the HTML converter (called Mammoth created by Michael Williamson) works. In part 2 of this article I will dive deeper into the code and share with you the changes I made in order to tackle some of the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of my previous blog article as input.