Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3)

Marc Lameriks

In part 1 and 2 of this article, I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article.
[https://technology.amis.nl/2020/03/10/writing-a-blog-in-word-automating-html-formatting-by-using-a-docx-to-html-converter-for-java-and-publishing-the-blog-via-wordpress-part-2/]

In this article I will focus on unordered HTML list (<ul> and <li>), HTML table (<table>), HTML Images (<img>) and hyperlinks.

As I mentioned in part 1 and 2 of this article, the manually changed HTML from WordPress is very different from the HTML converter output. For example:

  • There are a lot of paragraph tags (<p>)
  • The emphasized text tag <strong> is used for bold font weight
  • The emphasized text tag <em> is used for italic font style
  • Text colors like blue or purple are absent

I described how I was able to create the following tags, respectively for the styles “Code”, “Heading” and “Output”:

Because of better readability, I added the Style overflow CSS property to the style attribute of the HTML pre tag. I used auto as value so that the content is clipped and scroll bars are added when necessary.

<pre style=”font-family: Courier New; color: black; border: 1px solid gray; padding: 5px; margin-top: 10px; margin-bottom: 10px; overflow:auto;”>

I created a Java Class with a main method.

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n"); 
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

As I mentioned in part 1 of this article, the main steps of the HTML converter are:

  • Read the Word document (including all the children of its w:body element)
  • Convert the document to a list of HTML elements
  • Write the HTML element to a string via a StringBuilder

And I added to this:

  • Write the string to a file

As I mentioned in part 2 of this article, when constructing html, I wanted to be able to:

  • Replace a whitespace with &nbsp;
  • Have a begin tag followed by a new line
  • Set the begin tag indent level
  • Have an end tag followed by a new line
  • Have an end tag preceded by a new line

I described how I changed class HtmlTag, in order to handle that formatting:

public HtmlTag(List<String> tagNames, Map<String, String> attributes, boolean isCollapsible, String separator, boolean replaceWhitespaces, boolean beginTagFollowedByNewLine, int beginTagIndentLevel, boolean endTagFollowedByNewLine, boolean endTagPrecededByNewLine)

So, besides the separator parameter, I added the following parameters:

TypeParameterDescriptionEffect
booleanreplaceWhitespacesReplace a whitespace with &nbsp;&nbsp;
booleanbeginTagFollowedByNewLineHave a begin tag followed by a new line<begin tag>CRLF
intbeginTagIndentLevelSet the begin tag indent level  <begin tag>
booleanendTagFollowedByNewLineHave an end tag followed by a new line<end tag>CRLF
booleanendTagPrecededByNewLineHave an end tag preceded by a new lineCRLF<end tag>

The unordered HTML list (<ul> and <li>)

In order to use the formatting functionality for an unordered HTML list, I changed the code in method visit(Paragraph paragraph, Context context) of class DocumentToHtml, overriding in that way, the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Paragraph paragraph, Context context) {
            Supplier<List<HtmlNode>> children = () -> {
                List<HtmlNode> content = convertChildrenToHtml(paragraph, context);
                return preserveEmptyParagraphs ? cons(Html.FORCE_WRITE, content) : content;
            };
            HtmlPath mapping = styleMap.getParagraphHtmlPath(paragraph)
                .orElseGet(() -> {
                    if (paragraph.getStyle().isPresent()) {
                        warnings.add("Unrecognised paragraph style: " + paragraph.getStyle().get().describe());
                    }
                    return HtmlPath.element("p");
                });
            if (paragraph.getStyle().isPresent()) {
                System.out.println("****Paragraph style: " + paragraph.getStyle().get().getName().get());
                if (paragraph.getStyle().get().getName().get().equals("Code")) {
                    mapping = HtmlPath.collapsibleElement("pre", map("style", "font-family: Courier New; color: black; border: 1px solid gray; padding: 5px; margin-top: 10px; margin-bottom: 10px;"), "\r\n", true, true, 0, true, true);
                } else if (paragraph.getStyle().get().getName().get().equals("Output")) {
                    mapping = HtmlPath.collapsibleElement("a", map("style", "font-family: Courier New; color: black;"), "\r\n", true, true, 0, true, true);
                } else if (paragraph.getStyle().get().getName().get().equals("List Paragraph")) {
                    if (paragraph.getNumbering().isPresent()) {
                        int indentLevel = Integer.parseInt(paragraph.getNumbering().get().getLevelIndex());
                        if (paragraph.getNumbering().get().getLevelIndex().equalsIgnoreCase("0")) {
                            HtmlTag tagUL = new HtmlTag(list("ul"), map(), true, "", false, true, indentLevel, true, false);
                            HtmlPathElement pathElementUL = new HtmlPathElement(tagUL);
                            HtmlTag tagLI = new HtmlTag(list("li"), map(), false, "", false, false, indentLevel, true, false);
                            HtmlPathElement pathElementLI = new HtmlPathElement(tagLI);
                            mapping = new HtmlPathElements(list(pathElementUL, pathElementLI));
                        } else if (paragraph.getNumbering().get().getLevelIndex().equalsIgnoreCase("1")) {
                            HtmlTag tagULOL = new HtmlTag(list("ul", "ol"), map(), true, "", false, true, indentLevel, true, false);
                            HtmlPathElement pathElementULOL = new HtmlPathElement(tagULOL);
                            HtmlTag tagLICollapsible = new HtmlTag(list("li"), map(), true, "", false, false, indentLevel, true, false);
                            HtmlPathElement pathElementLICollapsible = new HtmlPathElement(tagLICollapsible);
                            HtmlTag tagUL = new HtmlTag(list("ul"), map(), true, "", false, true, indentLevel,false,false);
                            HtmlPathElement pathElementUL = new HtmlPathElement(tagUL);
                            HtmlTag tagLI = new HtmlTag(list("li"), map(), false, "", false, false, indentLevel, true, false);
                            HtmlPathElement pathElementLI = new HtmlPathElement(tagLI);
                            mapping = new HtmlPathElements(list(pathElementULOL, pathElementLICollapsible, pathElementUL, pathElementLI));
                        } else {
                            System.out.println("List Paragraph with indent level " + indentLevel + "not yet supported");
                        }
                    }
                }
            }
            return mapping.wrap(children).get();
        }

In the table below you can see an overview of the configured formatting functionality:

LevelIndexTagsParameterValueEffect
0ul (collapsible), lireplaceWhitespacesfalse
ul (collapsible)beginTagFollowedByNewLinetrue<ul>CRLF
lifalse<li>
ul (collapsible), libeginTagIndentLevel0  <begin tag>
ul (collapsible), liendTagFollowedByNewLinetrue<end tag>CRLF
ul (collapsible), liendTagPrecededByNewLinefalse<end tag>
1ul (collapsible), ol (collapsible)replaceWhitespacesfalse
beginTagFollowedByNewLinetrue<begin tag>CRLF
beginTagIndentLevel1  <begin tag>
endTagFollowedByNewLinetrue<end tag>CRLF
endTagPrecededByNewLinefalse<end tag>
li (collapsible)replaceWhitespacesfalse
beginTagFollowedByNewLinefalse<li>
beginTagIndentLevel1  <li>
endTagFollowedByNewLinetrue</li>CRLF
endTagPrecededByNewLinefalse</li>
ul (collapsible)replaceWhitespacesfalse
beginTagFollowedByNewLinetrue<ul>CRLF
beginTagIndentLevel1
endTagFollowedByNewLinefalse</ul>
endTagPrecededByNewLinefalse</ul>
lireplaceWhitespacesfalse
beginTagFollowedByNewLinefalse<li>
beginTagIndentLevel1
endTagFollowedByNewLinetrue</li> CRLF
endTagPrecededByNewLinefalse</li>

To figure that out I had a look at class DefaultStyles:

        "p:unordered-list(1) => ul > li:fresh",
        "p:unordered-list(2) => ul|ol > li > ul > li:fresh",
        "p:unordered-list(3) => ul|ol > li > ul|ol > li > ul > li:fresh",
        "p:unordered-list(4) => ul|ol > li > ul|ol > li > ul|ol > li > ul > li:fresh",
        "p:unordered-list(5) => ul|ol > li > ul|ol > li > ul|ol > li > ul|ol > li > ul > li:fresh",

I also changed the visit method in class HtmlWriter:
[in bold, I highlighted the changes]

            @Override
            public void visit(HtmlElement element) {
                String beginTag = "<";
                String endTag = "</";
                if (element.getTagName().equalsIgnoreCase("ul")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 4;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                    if (element.getBeginTagIndentLevel() > 0) {
                        builder.append("\r\n");
                    }
                } else if (element.getTagName().equalsIgnoreCase("li")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                }
                builder.append(beginTag).append(element.getTagName());

                HtmlWriter.generateAttributes(element.getAttributes(), builder);

                if (element.isVoid()) {
                    builder.append(" />");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                } else {
                    builder.append(">");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }

                    if (element.doReplaceWhitespaces()) {
                        StringBuilder childrenBuilder = new StringBuilder();
                        element.getChildren().forEach(child -> write(child, childrenBuilder));
                        builder.append(childrenBuilder.toString().replace(" ", "&nbsp;"));
                    } else {
                        element.getChildren().forEach(child -> write(child, builder));
                    }

                    if (element.getTagName().equalsIgnoreCase("ul")) {
                        int numberOfSpaces = 2 + element.getBeginTagIndentLevel() * 4;
                        endTag = String.format("%1$" + numberOfSpaces + "s", endTag);
                    }
                    if (element.isEndTagPrecededByNewLine()) {
                        endTag = "\r\n" + endTag;
                    }
                    builder
                        .append(endTag)
                        .append(element.getTagName())
                        .append(">");

                    if (element.isEndTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                }
            }

In the table below you can see an overview of the configured left padding:

LevelIndexVariableValueTagnumberOfSpacesformat stringEffectEffect including tag
0beginTag<ul 1“%1$1s”<<ul>
0beginTag<li 3“%1$3s”  <  <li>
1beginTag<ul 5“%1$5s”    <    <ul>
1beginTag<li 7“%1$7s”      <      <li>
1endTag</ul 6“%1$6s”    </    </ul>
0endTag</ul 2“%1$2s”</</ul>

Below you see an example of an unordered HTML list, used in an actual published blog article:
[https://technology.amis.nl/2020/01/15/rapidly-spinning-up-a-vm-with-ubuntu-and-k3s-with-the-kubernetes-dashboard-on-my-windows-laptop-using-vagrant-and-oracle-virtualbox/]

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

HTML table

With regard to a HTML table, I wanted to be able to create a tag like:

<table border=”1″ cellspacing=”0″ cellpadding=”0″ width=”100%” style=”table-layout: fixed; margin-bottom: 10px”>

In order to use the formatting functionality for a HTML table, I changed the code in method visit(Table table, Context context)of class DocumentToHtml overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Table table, Context context) {
            HtmlPath mapping = styleMap.getTableHtmlPath(table)
                .orElse(HtmlPath.element("table"));
            mapping = HtmlPath.collapsibleElement("table", map("border", "1", "cellspacing", "0", "cellpadding", "0", "width" , "100%", "style", "table-layout: fixed; margin-bottom: 10px"), "", false, true, 0 , true, false);
            return mapping.wrap(() -> generateTableChildren(table, context)).get();
        }

In the table below you can see an overview of the configured formatting functionality:

TagParameterValueEffect
tablereplaceWhitespacesfalse
beginTagFollowedByNewLinetrue<table>CRLF
beginTagIndentLevel0  <table>
endTagFollowedByNewLinetrue</table>CRLF
endTagPrecededByNewLinefalse</table>

Because this time the map with key value pairs for the attributes contained to much elements (actual and formal arguments list differ in length), I added a method map(K key1, V value1, K key2, V value2, K key3, V value3, K key4, V value4, K key5, V value5) to class org.zwobble.mammoth.internal.util.Maps:

    public static <K, V> Map<K, V> map(K key1, V value1, K key2, V value2, K key3, V value3, K key4, V value4,  K key5, V value5) {
        Map<K, V> map = new HashMap<>();
        map.put(key1, value1);
        map.put(key2, value2);
        map.put(key3, value3);
        map.put(key4, value4);
        map.put(key5, value5);
        return map;
    }

I wanted to add the formatting functionality for a row in an HTML table. For this I first looked at the current code in method visit(TableRow tableRow, Context context)of class DocumentToHtml:

        @Override
        public List<HtmlNode> visit(TableRow tableRow, Context context) {
            return list(Html.element("tr", Lists.cons(Html.FORCE_WRITE, convertChildrenToHtml(tableRow, context))));
        }

From the method above, the method element of class Html is called:

    public static HtmlNode element(String tagName, List<HtmlNode> children) {
        return element(tagName, map(), children);
    }

In turn, the method above calls the following method element of class Html:

    public static HtmlNode element(String tagName, Map<String, String> attributes, List<HtmlNode> children) {
        return new HtmlElement(new HtmlTag(list(tagName), attributes, false, ""), children);
    }

So, in order to use the formatting functionality for a row in an HTML table, I changed the code in method visit(TableRow tableRow, Context context)of class DocumentToHtml, overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(TableRow tableRow, Context context) {
            return list(new HtmlElement(new HtmlTag(list("tr"), map(), false, "", false, true, 0, true, false), Lists.cons(Html.FORCE_WRITE, convertChildrenToHtml(tableRow, context))));
        }

In the table below you can see an overview of the configured formatting functionality:

TagParameterValueEffect
trreplaceWhitespacesfalse
beginTagFollowedByNewLinetrue<tr>CRLF
beginTagIndentLevel0  <tr>
endTagFollowedByNewLinetrue</tr>CRLF
endTagPrecededByNewLinefalse</tr>

I also changed the visit method in class HtmlWriter:
[in bold, I highlighted the changes]

            @Override
            public void visit(HtmlElement element) {
                String beginTag = "<";
                String endTag = "</";
                if (element.getTagName().equalsIgnoreCase("ul")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 4;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                    if (element.getBeginTagIndentLevel() > 0) {
                        builder.append("\r\n");
                    }
                } else if (element.getTagName().equalsIgnoreCase("li")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                } else if (element.getTagName().equalsIgnoreCase("tr")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                } else if (element.getTagName().equalsIgnoreCase("th") || element.getTagName().equalsIgnoreCase("td")) {
                    int numberOfSpaces = 1 + 2 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                }
                builder.append(beginTag).append(element.getTagName());

                HtmlWriter.generateAttributes(element.getAttributes(), builder);

                if (element.isVoid()) {
                    builder.append(" />");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                } else {
                    builder.append(">");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }

                    if (element.doReplaceWhitespaces()) {
                        StringBuilder childrenBuilder = new StringBuilder();
                        element.getChildren().forEach(child -> write(child, childrenBuilder));
                        builder.append(childrenBuilder.toString().replace(" ", "&nbsp;"));
                    } else {
                        element.getChildren().forEach(child -> write(child, builder));
                    }

                    if (element.getTagName().equalsIgnoreCase("ul")) {
                        int numberOfSpaces = 2 + element.getBeginTagIndentLevel() * 4;
                        endTag = String.format("%1$" + numberOfSpaces + "s", endTag);
                    } else if (element.getTagName().equalsIgnoreCase("tr")) {
                        int numberOfSpaces = 2 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                        endTag = String.format("%1$" + numberOfSpaces + "s", endTag);
                    }
                    if (element.isEndTagPrecededByNewLine()) {
                        endTag = "\r\n" + endTag;
                    }
                    builder
                        .append(endTag)
                        .append(element.getTagName())
                        .append(">");

                    if (element.isEndTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                }
            }

            @Override
            public void visit(HtmlTextNode node) {
                builder.append(HtmlWriter.escapeText(node.getValue()));
            }

            @Override
            public void visit(HtmlForceWrite forceWrite) {
            }
        });
    }

In the table below you can see an overview of the configured left padding:

LevelIndexVariableValueTagnumberOfSpacesformat stringEffectEffect including tag
0beginTag<tr 3“%1$3s”  <  <tr>
0beginTag<th
td
5“%1$5s”    <    <td>
1beginTag<tr 7“%1$7s”      <      <tr>
1beginTag<th
td
9“%1$9s”        <        <td>
1endTag</tr 8“%1$8s”      </      </tr>
0endTag</tr 4“%1$4s”  </  </tr>

In order to check if the formatting functionality for a HTML table worked as I wanted, I used the Word document of yet another previous blog article as input.
[https://technology.amis.nl/2019/10/14/changing-the-configuration-of-an-oracle-weblogic-domain-deployed-on-a-kubernetes-cluster-using-oracle-weblogic-server-kubernetes-operator-part-1/]

Below you see an example of a HTML table, used in an actual published blog article:

Below is an example of the manually changed HTML from WordPress:

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

In order to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n");
        html = html.replaceAll("\r\n</td>", "</td>\r\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

As you can see the attributes of the table tag are placed in alphabetical order.

HTML Image

As mentioned in part 1 of this article, over the years I already used a helper tool, I wrote in Java, to help me for example replacing a picture place holder (text like 1.jpg) with the actual File URL used by WordPress (https://technology.amis.nl/wpcontent/uploads/2020/01/lameriks_2020_01_1.jpg) including the width and height.

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Here you can see that the image is encoded as base64 and then this media type data is embedded directly inline.

With regard to a HTML Image, with the helper tool, I already was able to automatically replace the picture place holder (text like 1.jpg) with a tag like:

<img src=”https://technology.amis.nl/wp-content/uploads/2020/01/lameriks_2020_01_1.jpg” alt=”” width=”1443″ height=”352″ />

My plan was to reuse the Java code from my helper tool and therefor skip the HTML converter formatting functionality for a HTML image.

So, I changed the code in method visit(Image image, Context context)of class DocumentToHtml overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Image image, Context context) {
            return list();
            // TODO: custom image handlers
            // TODO: handle empty content type
            return image.getContentType()
                .map(contentType -> {
                    try {
                        Map<String, String> attributes = new HashMap<>(imageConverter.convert(new org.zwobble.mammoth.images.Image() {
                            @Override
                            public Optional<String> getAltText() {
                                return image.getAltText();
                            }

                            @Override
                            public String getContentType() {
                                return contentType;
                            }

                            @Override
                            public InputStream getInputStream() throws IOException {
                                return image.open();
                            }
                        }));
                        image.getAltText().ifPresent(altText -> attributes.put("alt", altText));
                        return list(Html.element("img", attributes));
                    } catch (IOException exception) {
                        warnings.add(exception.getMessage());
                        return Lists.<HtmlNode>list();
                    }
                })
                .orElse(list());
        }

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

In order to reuse the Java code from my helper tool, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Set;

public class MyClass1 {
    private static HashMap<String, String> imagesMap = new HashMap<String, String>();

    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n");
        html = html.replaceAll("\r\n</td>", "</td>\r\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            readImages();
            writer.write(formatHtml(html));
        }
    }

    private static void readImages() {
        String year = "2020";
        String month = "01";
        String stringFolder = "C:\\My\\My Documents\\AMIS\\";
        File folder = new File(stringFolder + "images\\");

        FilenameFilter imagesFileFilter = new FilenameFilter()
        {
            @Override
            public boolean accept(File dir, String name)
            {
                if ( name.startsWith("lameriks") && (name.endsWith(".jpg") || name.endsWith(".png")) )
                {
                    return true;
                }
                else
                {
                    return false;
                }
            }
        };

        File[] files = folder.listFiles(imagesFileFilter);
        for (File file : files)
        {
            try{
                File imageFile = new File(file.getAbsolutePath());
                BufferedImage image = ImageIO.read(imageFile);
                String name = file.getName();
                String key = name.substring(name.lastIndexOf("_")+1,name.lastIndexOf("."));
                String value = "<img src=\"https://technology.amis.nl/wp-content/uploads/" + year + "/" + month + "/" + name + "\" alt=\"\" width=\"" + image.getWidth() + "\" height=\"" + image.getHeight() + "\" />";
                imagesMap.put(key , value);
            } catch (Exception ex){
                ex.printStackTrace();
            }
        }
    }

    private static String formatHtml(String html) {
        StringBuilder formattedHtml = new StringBuilder();
        // split by new line
        String[] lines = html.split("\\r\\n");
        for (String line : lines) {
            int position = line.indexOf(".jpg");
            if (position == -1) {
                position = line.indexOf(".png");
            }
            String imageNumber;
            String imageTag;
            if (position != -1) {
                imageNumber = line.substring(0, position);
                imageTag = imagesMap.get(imageNumber);
                if (imageTag != null) {
                    // First delete the last CRLF. That was the place where the original HTML image was located (with the media type data encoded as base64 and embedded directly inline)
                    formattedHtml.delete(formattedHtml.length() - 2, formattedHtml.length());
                    formattedHtml.append(imageTag);
                } else {
                    formattedHtml.append(line);
                }
            } else {
                formattedHtml.append(line);
            }
            formattedHtml.append("\r\n");
        }
        return formattedHtml.toString();
    }
}

For simplicity, in the code above I hard coded the year and month variables. In real live, I am actually using other directories and using command-line arguments.

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

HTML Links – Hyperlinks

As mentioned in part 1 of this article, over the years I already used a helper tool, I wrote in Java, to help me for example replacing a hyperlink place holder (text like [http://www.a.b.c.html]) with an actual hyperlink. In that way, I can skip the extra action needed in Word.

Below you can see an example of a hyperlink place holder (in Subscript) in my Word document:

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Again, my plan was to reuse the Java code from my helper tool.

With the helper tool, I already was able to automatically replace the hyperlink place holder (text like [http://www.a.b.c.html]) with a tag like:

<a href=” http://www.a.b.c.html”><sub>[ http://www.a.b.c.html]</sub></a>

In order to reuse the Java code from my helper tool, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Set;

public class MyClass1 {
    private static HashMap<String, String> imagesMap = new HashMap<String, String>();

    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n");
        html = html.replaceAll("\r\n</td>", "</td>\r\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            readImages();
            writer.write(formatHtml(html));
        }
    }

    private static void readImages() {
        String year = "2020";
        String month = "01";
        String stringFolder = "C:\\My\\My Documents\\AMIS\\";
        File folder = new File(stringFolder + "images\\");

        FilenameFilter imagesFileFilter = new FilenameFilter()
        {
            @Override
            public boolean accept(File dir, String name)
            {
                if ( name.startsWith("lameriks") && (name.endsWith(".jpg") || name.endsWith(".png")) )
                {
                    return true;
                }
                else
                {
                    return false;
                }
            }
        };

        File[] files = folder.listFiles(imagesFileFilter);
        for (File file : files)
        {
            try{
                File imageFile = new File(file.getAbsolutePath());
                BufferedImage image = ImageIO.read(imageFile);
                String name = file.getName();
                String key = name.substring(name.lastIndexOf("_")+1,name.lastIndexOf("."));
                String value = "<img src=\"https://technology.amis.nl/wp-content/uploads/" + year + "/" + month + "/" + name + "\" alt=\"\" width=\"" + image.getWidth() + "\" height=\"" + image.getHeight() + "\" />";
                imagesMap.put(key , value);
            } catch (Exception ex){
                ex.printStackTrace();
            }
        }
    }

    private static String formatHtml(String html) {
        StringBuilder formattedHtml = new StringBuilder();
        // split by new line
        String[] lines = html.split("\\r\\n");
        for (String line : lines) {
            int position = line.indexOf(".jpg");
            if (position == -1) {
                position = line.indexOf(".png");
            }
            String imageNumber;
            String imageTag;
            String subText;
            if (position != -1) {
                imageNumber = line.substring(0, position);
                imageTag = imagesMap.get(imageNumber);
                if (imageTag != null) {
                    // First delete the last CRLF. That was the place where the original HTML image was located (with the media type data encoded as base64 and embedded directly inline)
                    formattedHtml.delete(formattedHtml.length() - 2, formattedHtml.length());
                    formattedHtml.append(imageTag);
                } else {
                    formattedHtml.append(line);
                }
            } else {
                int startPosition = line.indexOf("<sub>");
                int endPosition = line.indexOf("</sub>");
                if (startPosition != -1 && endPosition != -1) {
                    subText = line.substring(startPosition, endPosition);
                    subText = subText.replace("<sub>", "");
                    subText = subText.replace("[", "");
                    subText = subText.replace("]", "");
                    subText = subText.replace("</sub>", "");
                    formattedHtml.append("<a href=\"" + subText + "\"><sub>[" + subText + "]</sub></a>");
                } else {
                    formattedHtml.append(line);
                }
            }
            formattedHtml.append("\r\n");
        }
        return formattedHtml.toString();
    }
}

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

After that I compared the differences between the manually changed HTML from WordPress and the HTML converter output and found that they were minimized even further and overall good enough for the time being.

So now it’s time to conclude this final part of the article. I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article. My goal is of course to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of a previous blog article as input.
In part 1, via debugging I gave a global overview of how the HTML converter (called Mammoth created by Michael Williamson) works. In part 2, I focused on text color, text font and using custom style maps for my Word styles “Code”, “Heading” and “Output”. In this final part of the article, I focused on unordered HTML list (<ul> and <li>), HTML table (<table>), HTML Images (<img>) and hyperlinks.

I already used my version of the HTML converter to publish articles and are pleased with the (time saving) results so far. Of course, depending on the content of my future articles, I will need the make further changes to the code.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

Migrating an old (10.2.0.4) database to Oracle Cloud with minimal downtime

Facebook0TwitterLinkedinUnlike most of our posts this post will contain almost no code or examples. But it hopefully will help  somebody who ends up to be in the same situation we landed on: migrating data from very old versions to a new environment. Recently we were tasked with the migration of […]