Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 1f

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3)

In part 1 and 2 of this article, I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article.
[https://technology.amis.nl/2020/03/10/writing-a-blog-in-word-automating-html-formatting-by-using-a-docx-to-html-converter-for-java-and-publishing-the-blog-via-wordpress-part-2/]

In this article I will focus on unordered HTML list (<ul> and <li>), HTML table (<table>), HTML Images (<img>) and hyperlinks.

As I mentioned in part 1 and 2 of this article, the manually changed HTML from WordPress is very different from the HTML converter output. For example:

  • There are a lot of paragraph tags (<p>)
  • The emphasized text tag <strong> is used for bold font weight
  • The emphasized text tag <em> is used for italic font style
  • Text colors like blue or purple are absent

I described how I was able to create the following tags, respectively for the styles “Code”, “Heading” and “Output”:

Because of better readability, I added the Style overflow CSS property to the style attribute of the HTML pre tag. I used auto as value so that the content is clipped and scroll bars are added when necessary.

<pre style=”font-family: Courier New; color: black; border: 1px solid gray; padding: 5px; margin-top: 10px; margin-bottom: 10px; overflow:auto;”>

I created a Java Class with a main method.

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n"); 
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

As I mentioned in part 1 of this article, the main steps of the HTML converter are:

  • Read the Word document (including all the children of its w:body element)
  • Convert the document to a list of HTML elements
  • Write the HTML element to a string via a StringBuilder

And I added to this:

  • Write the string to a file

As I mentioned in part 2 of this article, when constructing html, I wanted to be able to:

  • Replace a whitespace with &nbsp;
  • Have a begin tag followed by a new line
  • Set the begin tag indent level
  • Have an end tag followed by a new line
  • Have an end tag preceded by a new line

I described how I changed class HtmlTag, in order to handle that formatting:

public HtmlTag(List<String> tagNames, Map<String, String> attributes, boolean isCollapsible, String separator, boolean replaceWhitespaces, boolean beginTagFollowedByNewLine, int beginTagIndentLevel, boolean endTagFollowedByNewLine, boolean endTagPrecededByNewLine)

So, besides the separator parameter, I added the following parameters:

Type Parameter Description Effect
boolean replaceWhitespaces Replace a whitespace with &nbsp; &nbsp;
boolean beginTagFollowedByNewLine Have a begin tag followed by a new line <begin tag>CRLF
int beginTagIndentLevel Set the begin tag indent level   <begin tag>
boolean endTagFollowedByNewLine Have an end tag followed by a new line <end tag>CRLF
boolean endTagPrecededByNewLine Have an end tag preceded by a new line CRLF<end tag>

The unordered HTML list (<ul> and <li>)

In order to use the formatting functionality for an unordered HTML list, I changed the code in method visit(Paragraph paragraph, Context context) of class DocumentToHtml, overriding in that way, the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Paragraph paragraph, Context context) {
            Supplier<List<HtmlNode>> children = () -> {
                List<HtmlNode> content = convertChildrenToHtml(paragraph, context);
                return preserveEmptyParagraphs ? cons(Html.FORCE_WRITE, content) : content;
            };
            HtmlPath mapping = styleMap.getParagraphHtmlPath(paragraph)
                .orElseGet(() -> {
                    if (paragraph.getStyle().isPresent()) {
                        warnings.add("Unrecognised paragraph style: " + paragraph.getStyle().get().describe());
                    }
                    return HtmlPath.element("p");
                });
            if (paragraph.getStyle().isPresent()) {
                System.out.println("****Paragraph style: " + paragraph.getStyle().get().getName().get());
                if (paragraph.getStyle().get().getName().get().equals("Code")) {
                    mapping = HtmlPath.collapsibleElement("pre", map("style", "font-family: Courier New; color: black; border: 1px solid gray; padding: 5px; margin-top: 10px; margin-bottom: 10px;"), "\r\n", true, true, 0, true, true);
                } else if (paragraph.getStyle().get().getName().get().equals("Output")) {
                    mapping = HtmlPath.collapsibleElement("a", map("style", "font-family: Courier New; color: black;"), "\r\n", true, true, 0, true, true);
                } else if (paragraph.getStyle().get().getName().get().equals("List Paragraph")) {
                    if (paragraph.getNumbering().isPresent()) {
                        int indentLevel = Integer.parseInt(paragraph.getNumbering().get().getLevelIndex());
                        if (paragraph.getNumbering().get().getLevelIndex().equalsIgnoreCase("0")) {
                            HtmlTag tagUL = new HtmlTag(list("ul"), map(), true, "", false, true, indentLevel, true, false);
                            HtmlPathElement pathElementUL = new HtmlPathElement(tagUL);
                            HtmlTag tagLI = new HtmlTag(list("li"), map(), false, "", false, false, indentLevel, true, false);
                            HtmlPathElement pathElementLI = new HtmlPathElement(tagLI);
                            mapping = new HtmlPathElements(list(pathElementUL, pathElementLI));
                        } else if (paragraph.getNumbering().get().getLevelIndex().equalsIgnoreCase("1")) {
                            HtmlTag tagULOL = new HtmlTag(list("ul", "ol"), map(), true, "", false, true, indentLevel, true, false);
                            HtmlPathElement pathElementULOL = new HtmlPathElement(tagULOL);
                            HtmlTag tagLICollapsible = new HtmlTag(list("li"), map(), true, "", false, false, indentLevel, true, false);
                            HtmlPathElement pathElementLICollapsible = new HtmlPathElement(tagLICollapsible);
                            HtmlTag tagUL = new HtmlTag(list("ul"), map(), true, "", false, true, indentLevel,false,false);
                            HtmlPathElement pathElementUL = new HtmlPathElement(tagUL);
                            HtmlTag tagLI = new HtmlTag(list("li"), map(), false, "", false, false, indentLevel, true, false);
                            HtmlPathElement pathElementLI = new HtmlPathElement(tagLI);
                            mapping = new HtmlPathElements(list(pathElementULOL, pathElementLICollapsible, pathElementUL, pathElementLI));
                        } else {
                            System.out.println("List Paragraph with indent level " + indentLevel + "not yet supported");
                        }
                    }
                }
            }
            return mapping.wrap(children).get();
        }

In the table below you can see an overview of the configured formatting functionality:

LevelIndex Tags Parameter Value Effect
0 ul (collapsible), li replaceWhitespaces false
ul (collapsible) beginTagFollowedByNewLine true <ul>CRLF
li false <li>
ul (collapsible), li beginTagIndentLevel 0   <begin tag>
ul (collapsible), li endTagFollowedByNewLine true <end tag>CRLF
ul (collapsible), li endTagPrecededByNewLine false <end tag>
1 ul (collapsible), ol (collapsible) replaceWhitespaces false
beginTagFollowedByNewLine true <begin tag>CRLF
beginTagIndentLevel 1   <begin tag>
endTagFollowedByNewLine true <end tag>CRLF
endTagPrecededByNewLine false <end tag>
li (collapsible) replaceWhitespaces false
beginTagFollowedByNewLine false <li>
beginTagIndentLevel 1   <li>
endTagFollowedByNewLine true </li>CRLF
endTagPrecededByNewLine false </li>
ul (collapsible) replaceWhitespaces false
beginTagFollowedByNewLine true <ul>CRLF
beginTagIndentLevel 1
endTagFollowedByNewLine false </ul>
endTagPrecededByNewLine false </ul>
li replaceWhitespaces false
beginTagFollowedByNewLine false <li>
beginTagIndentLevel 1
endTagFollowedByNewLine true </li> CRLF
endTagPrecededByNewLine false </li>

To figure that out I had a look at class DefaultStyles:

        "p:unordered-list(1) => ul > li:fresh",
        "p:unordered-list(2) => ul|ol > li > ul > li:fresh",
        "p:unordered-list(3) => ul|ol > li > ul|ol > li > ul > li:fresh",
        "p:unordered-list(4) => ul|ol > li > ul|ol > li > ul|ol > li > ul > li:fresh",
        "p:unordered-list(5) => ul|ol > li > ul|ol > li > ul|ol > li > ul|ol > li > ul > li:fresh",

I also changed the visit method in class HtmlWriter:
[in bold, I highlighted the changes]

            @Override
            public void visit(HtmlElement element) {
                String beginTag = "<";
                String endTag = "</";
                if (element.getTagName().equalsIgnoreCase("ul")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 4;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                    if (element.getBeginTagIndentLevel() > 0) {
                        builder.append("\r\n");
                    }
                } else if (element.getTagName().equalsIgnoreCase("li")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                }
                builder.append(beginTag).append(element.getTagName());

                HtmlWriter.generateAttributes(element.getAttributes(), builder);

                if (element.isVoid()) {
                    builder.append(" />");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                } else {
                    builder.append(">");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }

                    if (element.doReplaceWhitespaces()) {
                        StringBuilder childrenBuilder = new StringBuilder();
                        element.getChildren().forEach(child -> write(child, childrenBuilder));
                        builder.append(childrenBuilder.toString().replace(" ", "&nbsp;"));
                    } else {
                        element.getChildren().forEach(child -> write(child, builder));
                    }

                    if (element.getTagName().equalsIgnoreCase("ul")) {
                        int numberOfSpaces = 2 + element.getBeginTagIndentLevel() * 4;
                        endTag = String.format("%1$" + numberOfSpaces + "s", endTag);
                    }
                    if (element.isEndTagPrecededByNewLine()) {
                        endTag = "\r\n" + endTag;
                    }
                    builder
                        .append(endTag)
                        .append(element.getTagName())
                        .append(">");

                    if (element.isEndTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                }
            }

In the table below you can see an overview of the configured left padding:

LevelIndex Variable Value Tag numberOfSpaces format string Effect Effect including tag
0 beginTag < ul 1 “%1$1s” < <ul>
0 beginTag < li 3 “%1$3s”   <   <li>
1 beginTag < ul 5 “%1$5s”     <     <ul>
1 beginTag < li 7 “%1$7s”       <       <li>
1 endTag </ ul 6 “%1$6s”     </     </ul>
0 endTag </ ul 2 “%1$2s” </ </ul>

Below you see an example of an unordered HTML list, used in an actual published blog article:
[https://technology.amis.nl/2020/01/15/rapidly-spinning-up-a-vm-with-ubuntu-and-k3s-with-the-kubernetes-dashboard-on-my-windows-laptop-using-vagrant-and-oracle-virtualbox/]

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 1

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 2

HTML table

With regard to a HTML table, I wanted to be able to create a tag like:

<table border=”1″ cellspacing=”0″ cellpadding=”0″ width=”100%” style=”table-layout: fixed; margin-bottom: 10px”>

In order to use the formatting functionality for a HTML table, I changed the code in method visit(Table table, Context context)of class DocumentToHtml overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Table table, Context context) {
            HtmlPath mapping = styleMap.getTableHtmlPath(table)
                .orElse(HtmlPath.element("table"));
            mapping = HtmlPath.collapsibleElement("table", map("border", "1", "cellspacing", "0", "cellpadding", "0", "width" , "100%", "style", "table-layout: fixed; margin-bottom: 10px"), "", false, true, 0 , true, false);
            return mapping.wrap(() -> generateTableChildren(table, context)).get();
        }

In the table below you can see an overview of the configured formatting functionality:

Tag Parameter Value Effect
table replaceWhitespaces false
beginTagFollowedByNewLine true <table>CRLF
beginTagIndentLevel 0   <table>
endTagFollowedByNewLine true </table>CRLF
endTagPrecededByNewLine false </table>

Because this time the map with key value pairs for the attributes contained to much elements (actual and formal arguments list differ in length), I added a method map(K key1, V value1, K key2, V value2, K key3, V value3, K key4, V value4, K key5, V value5) to class org.zwobble.mammoth.internal.util.Maps:

    public static <K, V> Map<K, V> map(K key1, V value1, K key2, V value2, K key3, V value3, K key4, V value4,  K key5, V value5) {
        Map<K, V> map = new HashMap<>();
        map.put(key1, value1);
        map.put(key2, value2);
        map.put(key3, value3);
        map.put(key4, value4);
        map.put(key5, value5);
        return map;
    }

I wanted to add the formatting functionality for a row in an HTML table. For this I first looked at the current code in method visit(TableRow tableRow, Context context)of class DocumentToHtml:

        @Override
        public List<HtmlNode> visit(TableRow tableRow, Context context) {
            return list(Html.element("tr", Lists.cons(Html.FORCE_WRITE, convertChildrenToHtml(tableRow, context))));
        }

From the method above, the method element of class Html is called:

    public static HtmlNode element(String tagName, List<HtmlNode> children) {
        return element(tagName, map(), children);
    }

In turn, the method above calls the following method element of class Html:

    public static HtmlNode element(String tagName, Map<String, String> attributes, List<HtmlNode> children) {
        return new HtmlElement(new HtmlTag(list(tagName), attributes, false, ""), children);
    }

So, in order to use the formatting functionality for a row in an HTML table, I changed the code in method visit(TableRow tableRow, Context context)of class DocumentToHtml, overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(TableRow tableRow, Context context) {
            return list(new HtmlElement(new HtmlTag(list("tr"), map(), false, "", false, true, 0, true, false), Lists.cons(Html.FORCE_WRITE, convertChildrenToHtml(tableRow, context))));
        }

In the table below you can see an overview of the configured formatting functionality:

Tag Parameter Value Effect
tr replaceWhitespaces false
beginTagFollowedByNewLine true <tr>CRLF
beginTagIndentLevel 0   <tr>
endTagFollowedByNewLine true </tr>CRLF
endTagPrecededByNewLine false </tr>

I also changed the visit method in class HtmlWriter:
[in bold, I highlighted the changes]

            @Override
            public void visit(HtmlElement element) {
                String beginTag = "<";
                String endTag = "</";
                if (element.getTagName().equalsIgnoreCase("ul")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 4;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                    if (element.getBeginTagIndentLevel() > 0) {
                        builder.append("\r\n");
                    }
                } else if (element.getTagName().equalsIgnoreCase("li")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                } else if (element.getTagName().equalsIgnoreCase("tr")) {
                    int numberOfSpaces = 1 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                } else if (element.getTagName().equalsIgnoreCase("th") || element.getTagName().equalsIgnoreCase("td")) {
                    int numberOfSpaces = 1 + 2 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                    beginTag = String.format("%1$" + numberOfSpaces + "s", beginTag);
                }
                builder.append(beginTag).append(element.getTagName());

                HtmlWriter.generateAttributes(element.getAttributes(), builder);

                if (element.isVoid()) {
                    builder.append(" />");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                } else {
                    builder.append(">");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }

                    if (element.doReplaceWhitespaces()) {
                        StringBuilder childrenBuilder = new StringBuilder();
                        element.getChildren().forEach(child -> write(child, childrenBuilder));
                        builder.append(childrenBuilder.toString().replace(" ", "&nbsp;"));
                    } else {
                        element.getChildren().forEach(child -> write(child, builder));
                    }

                    if (element.getTagName().equalsIgnoreCase("ul")) {
                        int numberOfSpaces = 2 + element.getBeginTagIndentLevel() * 4;
                        endTag = String.format("%1$" + numberOfSpaces + "s", endTag);
                    } else if (element.getTagName().equalsIgnoreCase("tr")) {
                        int numberOfSpaces = 2 + element.getBeginTagIndentLevel() * 2 + (element.getBeginTagIndentLevel() + 1) * 2;
                        endTag = String.format("%1$" + numberOfSpaces + "s", endTag);
                    }
                    if (element.isEndTagPrecededByNewLine()) {
                        endTag = "\r\n" + endTag;
                    }
                    builder
                        .append(endTag)
                        .append(element.getTagName())
                        .append(">");

                    if (element.isEndTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                }
            }

            @Override
            public void visit(HtmlTextNode node) {
                builder.append(HtmlWriter.escapeText(node.getValue()));
            }

            @Override
            public void visit(HtmlForceWrite forceWrite) {
            }
        });
    }

In the table below you can see an overview of the configured left padding:

LevelIndex Variable Value Tag numberOfSpaces format string Effect Effect including tag
0 beginTag < tr 3 “%1$3s”   <   <tr>
0 beginTag < th
td
5 “%1$5s”     <     <td>
1 beginTag < tr 7 “%1$7s”       <       <tr>
1 beginTag < th
td
9 “%1$9s”         <         <td>
1 endTag </ tr 8 “%1$8s”       </       </tr>
0 endTag </ tr 4 “%1$4s”   </   </tr>

In order to check if the formatting functionality for a HTML table worked as I wanted, I used the Word document of yet another previous blog article as input.
[https://technology.amis.nl/2019/10/14/changing-the-configuration-of-an-oracle-weblogic-domain-deployed-on-a-kubernetes-cluster-using-oracle-weblogic-server-kubernetes-operator-part-1/]

Below you see an example of a HTML table, used in an actual published blog article:

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 3

Below is an example of the manually changed HTML from WordPress:

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 4

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 5

In order to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n");
        html = html.replaceAll("\r\n</td>", "</td>\r\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 6

As you can see the attributes of the table tag are placed in alphabetical order.

HTML Image

As mentioned in part 1 of this article, over the years I already used a helper tool, I wrote in Java, to help me for example replacing a picture place holder (text like 1.jpg) with the actual File URL used by WordPress (https://technology.amis.nl/wpcontent/uploads/2020/01/lameriks_2020_01_1.jpg) including the width and height.

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 7

Here you can see that the image is encoded as base64 and then this media type data is embedded directly inline.

With regard to a HTML Image, with the helper tool, I already was able to automatically replace the picture place holder (text like 1.jpg) with a tag like:

<img src=”https://technology.amis.nl/wp-content/uploads/2020/01/lameriks_2020_01_1.jpg” alt=”” width=”1443″ height=”352″ />

My plan was to reuse the Java code from my helper tool and therefor skip the HTML converter formatting functionality for a HTML image.

So, I changed the code in method visit(Image image, Context context)of class DocumentToHtml overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Image image, Context context) {
            return list();
            // TODO: custom image handlers
            // TODO: handle empty content type
            return image.getContentType()
                .map(contentType -> {
                    try {
                        Map<String, String> attributes = new HashMap<>(imageConverter.convert(new org.zwobble.mammoth.images.Image() {
                            @Override
                            public Optional<String> getAltText() {
                                return image.getAltText();
                            }

                            @Override
                            public String getContentType() {
                                return contentType;
                            }

                            @Override
                            public InputStream getInputStream() throws IOException {
                                return image.open();
                            }
                        }));
                        image.getAltText().ifPresent(altText -> attributes.put("alt", altText));
                        return list(Html.element("img", attributes));
                    } catch (IOException exception) {
                        warnings.add(exception.getMessage());
                        return Lists.<HtmlNode>list();
                    }
                })
                .orElse(list());
        }

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 8

In order to reuse the Java code from my helper tool, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Set;

public class MyClass1 {
    private static HashMap<String, String> imagesMap = new HashMap<String, String>();

    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n");
        html = html.replaceAll("\r\n</td>", "</td>\r\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            readImages();
            writer.write(formatHtml(html));
        }
    }

    private static void readImages() {
        String year = "2020";
        String month = "01";
        String stringFolder = "C:\\My\\My Documents\\AMIS\\";
        File folder = new File(stringFolder + "images\\");

        FilenameFilter imagesFileFilter = new FilenameFilter()
        {
            @Override
            public boolean accept(File dir, String name)
            {
                if ( name.startsWith("lameriks") && (name.endsWith(".jpg") || name.endsWith(".png")) )
                {
                    return true;
                }
                else
                {
                    return false;
                }
            }
        };

        File[] files = folder.listFiles(imagesFileFilter);
        for (File file : files)
        {
            try{
                File imageFile = new File(file.getAbsolutePath());
                BufferedImage image = ImageIO.read(imageFile);
                String name = file.getName();
                String key = name.substring(name.lastIndexOf("_")+1,name.lastIndexOf("."));
                String value = "<img src=\"https://technology.amis.nl/wp-content/uploads/" + year + "/" + month + "/" + name + "\" alt=\"\" width=\"" + image.getWidth() + "\" height=\"" + image.getHeight() + "\" />";
                imagesMap.put(key , value);
            } catch (Exception ex){
                ex.printStackTrace();
            }
        }
    }

    private static String formatHtml(String html) {
        StringBuilder formattedHtml = new StringBuilder();
        // split by new line
        String[] lines = html.split("\\r\\n");
        for (String line : lines) {
            int position = line.indexOf(".jpg");
            if (position == -1) {
                position = line.indexOf(".png");
            }
            String imageNumber;
            String imageTag;
            if (position != -1) {
                imageNumber = line.substring(0, position);
                imageTag = imagesMap.get(imageNumber);
                if (imageTag != null) {
                    // First delete the last CRLF. That was the place where the original HTML image was located (with the media type data encoded as base64 and embedded directly inline)
                    formattedHtml.delete(formattedHtml.length() - 2, formattedHtml.length());
                    formattedHtml.append(imageTag);
                } else {
                    formattedHtml.append(line);
                }
            } else {
                formattedHtml.append(line);
            }
            formattedHtml.append("\r\n");
        }
        return formattedHtml.toString();
    }
}

For simplicity, in the code above I hard coded the year and month variables. In real live, I am actually using other directories and using command-line arguments.

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 9

HTML Links – Hyperlinks

As mentioned in part 1 of this article, over the years I already used a helper tool, I wrote in Java, to help me for example replacing a hyperlink place holder (text like [http://www.a.b.c.html]) with an actual hyperlink. In that way, I can skip the extra action needed in Word.

Below you can see an example of a hyperlink place holder (in Subscript) in my Word document:

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 10

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 11

Again, my plan was to reuse the Java code from my helper tool.

With the helper tool, I already was able to automatically replace the hyperlink place holder (text like [http://www.a.b.c.html]) with a tag like:

<a href=” http://www.a.b.c.html”><sub>[ http://www.a.b.c.html]</sub></a>

In order to reuse the Java code from my helper tool, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Set;

public class MyClass1 {
    private static HashMap<String, String> imagesMap = new HashMap<String, String>();

    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n");
        html = html.replaceAll("\r\n</td>", "</td>\r\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            readImages();
            writer.write(formatHtml(html));
        }
    }

    private static void readImages() {
        String year = "2020";
        String month = "01";
        String stringFolder = "C:\\My\\My Documents\\AMIS\\";
        File folder = new File(stringFolder + "images\\");

        FilenameFilter imagesFileFilter = new FilenameFilter()
        {
            @Override
            public boolean accept(File dir, String name)
            {
                if ( name.startsWith("lameriks") && (name.endsWith(".jpg") || name.endsWith(".png")) )
                {
                    return true;
                }
                else
                {
                    return false;
                }
            }
        };

        File[] files = folder.listFiles(imagesFileFilter);
        for (File file : files)
        {
            try{
                File imageFile = new File(file.getAbsolutePath());
                BufferedImage image = ImageIO.read(imageFile);
                String name = file.getName();
                String key = name.substring(name.lastIndexOf("_")+1,name.lastIndexOf("."));
                String value = "<img src=\"https://technology.amis.nl/wp-content/uploads/" + year + "/" + month + "/" + name + "\" alt=\"\" width=\"" + image.getWidth() + "\" height=\"" + image.getHeight() + "\" />";
                imagesMap.put(key , value);
            } catch (Exception ex){
                ex.printStackTrace();
            }
        }
    }

    private static String formatHtml(String html) {
        StringBuilder formattedHtml = new StringBuilder();
        // split by new line
        String[] lines = html.split("\\r\\n");
        for (String line : lines) {
            int position = line.indexOf(".jpg");
            if (position == -1) {
                position = line.indexOf(".png");
            }
            String imageNumber;
            String imageTag;
            String subText;
            if (position != -1) {
                imageNumber = line.substring(0, position);
                imageTag = imagesMap.get(imageNumber);
                if (imageTag != null) {
                    // First delete the last CRLF. That was the place where the original HTML image was located (with the media type data encoded as base64 and embedded directly inline)
                    formattedHtml.delete(formattedHtml.length() - 2, formattedHtml.length());
                    formattedHtml.append(imageTag);
                } else {
                    formattedHtml.append(line);
                }
            } else {
                int startPosition = line.indexOf("<sub>");
                int endPosition = line.indexOf("</sub>");
                if (startPosition != -1 && endPosition != -1) {
                    subText = line.substring(startPosition, endPosition);
                    subText = subText.replace("<sub>", "");
                    subText = subText.replace("[", "");
                    subText = subText.replace("]", "");
                    subText = subText.replace("</sub>", "");
                    formattedHtml.append("<a href=\"" + subText + "\"><sub>[" + subText + "]</sub></a>");
                } else {
                    formattedHtml.append(line);
                }
            }
            formattedHtml.append("\r\n");
        }
        return formattedHtml.toString();
    }
}

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 3) lameriks 2020 04 12

After that I compared the differences between the manually changed HTML from WordPress and the HTML converter output and found that they were minimized even further and overall good enough for the time being.

So now it’s time to conclude this final part of the article. I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article. My goal is of course to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of a previous blog article as input.
In part 1, via debugging I gave a global overview of how the HTML converter (called Mammoth created by Michael Williamson) works. In part 2, I focused on text color, text font and using custom style maps for my Word styles “Code”, “Heading” and “Output”. In this final part of the article, I focused on unordered HTML list (<ul> and <li>), HTML table (<table>), HTML Images (<img>) and hyperlinks.

I already used my version of the HTML converter to publish articles and are pleased with the (time saving) results so far. Of course, depending on the content of my future articles, I will need the make further changes to the code.