Writing a blog in Word, automating HTML formatting by using a .docx to HTML converter for Java and publishing the blog via WordPress (part 2)

Marc Lameriks

In part 1 of this article, I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article. Via debugging I gave a global overview of how the HTML converter (called Mammoth created by Michael Williamson) works.
[https://technology.amis.nl/2020/02/26/writing-a-blog-in-word-automating-html-formatting-by-using-a-docx-to-html-converter-for-java-and-publishing-the-blog-via-wordpress-part-1/]

In this article I will dive deeper into the code and share with you the changes I made in order to tackle some of the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of a previous blog article as input. In this article I will focus on text color, text font and using custom style maps for my Word styles “Code”, “Heading” and “Output”.

As I mentioned in part 1 of this article, the manually changed HTML from WordPress is very different from the HTML converter output. For example:

  • There are a lot of paragraph tags (<p>)
  • The emphasized text tag <strong> is used for bold font weight
  • The emphasized text tag <em> is used for italic font style
  • Text colors like blue or purple are absent

I described that I tackled some of these differences by making changes to the default style mappings. In order to get rid of the paragraph tags, I used the replaceAll method on the output of the converter.

I created a Java Class with a main method.

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u");
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\n");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

As I mentioned in part 1 of this article, the main steps of the HTML converter are:

  • Read the Word document (including all the children of its w:body element)
  • Convert the document to a list of HTML elements
  • Write the HTML element to a string via a StringBuilder

And I added to this:

  • Write the string to a file

Text color

One of the differences mentioned above is that text colors like blue or purple are absent.

Below you see an example of text color use in an actual published blog article:
[https://technology.amis.nl/2020/01/15/rapidly-spinning-up-a-vm-with-ubuntu-and-k3s-with-the-kubernetes-dashboard-on-my-windows-laptop-using-vagrant-and-oracle-virtualbox/]

With the current Java code, when I run it, and after applying pretty print (in Notepad++), the output is (with the focus on the part mentioned above):

Text font

One of the other differences is that text font like Courier New is absent.

Below you see an example of text font use in an actual published blog article:
[https://technology.amis.nl/2020/01/15/rapidly-spinning-up-a-vm-with-ubuntu-and-k3s-with-the-kubernetes-dashboard-on-my-windows-laptop-using-vagrant-and-oracle-virtualbox/]

Via debugging of parameter nodes in method readElements from class StatefulBodyXmlReader, I found the XmlElement containing it.

With the current Java code, when I run it, the output is (with the focus on the part mentioned above):

Method readRun of class StatefulBodyXmlReader

As I mentioned in part 1 of this article, the Run object is important with regard to properties, such as formatting. For example, it knows if the children are in Bold, Italic, etc.

Here is an example of a Run object being created:

Using text font and color

I changed class Run, in order to handle text font and color:
[in bold, I highlighted the changes]

package org.zwobble.mammoth.internal.documents;

import java.util.List;
import java.util.Optional;

public class Run implements DocumentElement, HasChildren {
    private final boolean isBold;
    private final boolean isItalic;
    private final boolean isUnderline;
    private final boolean isStrikethrough;
    private final boolean isSmallCaps;
    private final VerticalAlignment verticalAlignment;
    private final Optional<Style> style;
    private final Optional<String> font;
    private final Optional<String> color;
    private final List<DocumentElement> children;

    public Run(
        boolean isBold,
        boolean isItalic,
        boolean isUnderline,
        boolean isStrikethrough,
        boolean isSmallCaps,
        VerticalAlignment verticalAlignment,
        Optional<Style> style,
        Optional<String> font,
        Optional<String> color,
        List<DocumentElement> children
    ) {
        this.isBold = isBold;
        this.isItalic = isItalic;
        this.isUnderline = isUnderline;
        this.isStrikethrough = isStrikethrough;
        this.isSmallCaps = isSmallCaps;
        this.verticalAlignment = verticalAlignment;
        this.style = style;
        this.font = font;
        this.color = color;
        this.children = children;
    }

    public boolean isBold() {
        return isBold;
    }

    public boolean isItalic() {
        return isItalic;
    }

    public boolean isUnderline() {
        return isUnderline;
    }

    public boolean isStrikethrough() {
        return isStrikethrough;
    }

    public boolean isSmallCaps() {
        return isSmallCaps;
    }

    public VerticalAlignment getVerticalAlignment() {
        return verticalAlignment;
    }

    public Optional<Style> getStyle() {
        return style;
    }

    public Optional<String> getFont() {
        return font;
    }

    public Optional<String> getColor() {
        return color;
    }
    
    public List<DocumentElement> getChildren() {
        return children;
    }

    @Override
    public <T, U> T accept(DocumentElementVisitor<T, U> visitor, U context) {
        return visitor.visit(this, context);
    }
}

And I changed method readRun from class StatefulBodyXmlReader:
[in bold, I highlighted the changes]

    private ReadResult readRun(XmlElement element) {
        XmlElementLike properties = element.findChildOrEmpty("w:rPr");
        return ReadResult.map(
            readRunStyle(properties),
            readElements(element.getChildren()),
            (style, children) -> {
                Optional<String> hyperlinkHref = currentHyperlinkHref();
                if (hyperlinkHref.isPresent()) {
                    children = list(Hyperlink.href(hyperlinkHref.get(), Optional.empty(), children));
                }

                Optional<String> font = Optional.empty();
                Optional<XmlElement> fontsXmlElement = properties.findChild("w:rFonts");
                if (fontsXmlElement.isPresent()) {
                    font = fontsXmlElement.get().getAttributeOrNone("w:ascii");
                }

                Optional<String> color = readVal(properties, "w:color");
                return new Run(
                    isBold(properties),
                    isItalic(properties),
                    isUnderline(properties),
                    isStrikethrough(properties),
                    isSmallCaps(properties),
                    readVerticalAlignment(properties),
                    style,
                    font,
                    color,
                    children
                );
            }
        );
    }

Method visit(Run run, Context context) of class DocumentToHtml

A Document object with Paragraph and Run objects is converted to a list of HtmlElement’s.

I wanted to be able to create tags like:

I added code to method visit(Run run, Context context) of class DocumentToHtml:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Run run, Context context) {
            Supplier<List<HtmlNode>> nodes = () -> convertChildrenToHtml(run, context);
            if (run.isSmallCaps()) {
                nodes = styleMap.getSmallCaps().orElse(HtmlPath.EMPTY).wrap(nodes);
            }
            if (run.isStrikethrough()) {
                nodes = styleMap.getStrikethrough().orElse(HtmlPath.collapsibleElement("s")).wrap(nodes);
            }
            if (run.isUnderline()) {
                nodes = styleMap.getUnderline().orElse(HtmlPath.EMPTY).wrap(nodes);
            }
            if (run.getVerticalAlignment() == VerticalAlignment.SUBSCRIPT) {
                nodes = HtmlPath.collapsibleElement("sub").wrap(nodes);
            }
            if (run.getVerticalAlignment() == VerticalAlignment.SUPERSCRIPT) {
                nodes = HtmlPath.collapsibleElement("sup").wrap(nodes);
            }
            if (run.isItalic()) {
                nodes = styleMap.getItalic().orElse(HtmlPath.collapsibleElement("em")).wrap(nodes);
            }
            if (run.isBold()) {
                nodes = styleMap.getBold().orElse(HtmlPath.collapsibleElement("strong")).wrap(nodes);
            }
            if (run.getFont().isPresent()) {
                String color = "black";
                if  (run.getColor().isPresent()) {
                    color = convertColor(run.getColor().get());
                }
                nodes = HtmlPath.collapsibleElement("a", map("style", "font-family: Courier New; color: " + color + ";")).wrap(nodes);
            } else if (run.getColor().isPresent()) {
                String color = "black";
                if  (run.getColor().isPresent()) {
                    color = convertColor(run.getColor().get());
                }
                nodes = HtmlPath.collapsibleElement("a", map("style", "color: " + color + ";")).wrap(nodes);
            }
            HtmlPath mapping = styleMap.getRunHtmlPath(run)
                .orElseGet(() -> {
                    if (run.getStyle().isPresent()) {
                        warnings.add("Unrecognised run style: " + run.getStyle().get().describe());
                    }
                    return HtmlPath.EMPTY;
                });
            return mapping.wrap(nodes).get();
        }

And I added a private method:

        private String convertColor(String color) {
            String convertedColor = "#AMIS_color_could_not_be_converted#";
            switch(color)
            {
                case "E32219":
                    convertedColor = "red";
                    break;
                case "0070C0":
                    convertedColor = "blue";
                    break;
                case "7030A0":
                    convertedColor = "purple";
                    break;
                case "00B050":
                    convertedColor = "dark green";
                    break;
            }
            return convertedColor;
        }

With the current Java code, when I run it, the output is (with the focus on the part mentioned earlier with regard to text font):

And another example of output is (with the focus on the part mentioned earlier with regard to text color):

Using styles

In my blog article Word document, I use styles to format for example headings (“Heading”), source code (“Code”) and command output (“Output”).

I wanted to be able to create the following tags, respectively for the styles “Code”, “Heading” and “Output”:

As mentioned in part 1 of this article, by default the HTML converter maps some common .docx styles to HTML elements.

So, I added code to implement the styles mentioned above:
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')");
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\n");
        //html = html.replaceAll("\n</td>", "</td>");
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

Remark about freshness:
When generating, the HTML converter will only close an HTML element when necessary. Otherwise, elements are reused. For instance, if the HTML converter encounters a .docx paragraph with the style name Heading 1, the .docx paragraph is converted to a h1 element with the same text. If the next .docx paragraph also has the style name Heading 1, then the text of that paragraph will be appended to the existing h1 element, rather than creating a new h1 element. In most cases, you’ll probably want to generate a new h1 element instead. You can specify this by using the :fresh modifier.
[https://github.com/mwilliamson/java-mammoth#freshness]

Remark about separators:
To specify a separator to place between the contents of paragraphs that are collapsed together, use :separator(‘SEPARATOR STRING’).
For instance, suppose a document contains a block of code where each line of code is a paragraph with the style Code Block. We can write a style mapping to map such paragraphs to <pre> elements:
p[style-name=’Code Block’] => pre
Since pre isn’t marked as :fresh, consecutive pre elements will be collapsed together. However, this results in the code all being on one line. We can use :separator to insert a newline between each line of code.
[https://github.com/mwilliamson/java-mammoth#separators]

Method readParagraph of class StatefulBodyXmlReader

Here is an example of a Paragraph object being created:

Method visit(Paragraph paragraph, Context context) of class DocumentToHtml

A Document object with Paragraph and Run objects is converted to a list of HtmlElement’s.

I wanted to be able to create tags like:

The default behavior of the HTML converter when using custom style maps didn’t quit fit my needs, because in the start tag, I also wanted to add some attributes.

In order to get a feeling about how the converter works, I placed a breakpoint in method visit(Paragraph paragraph, Context context) of class DocumentToHtml, started debugging and via styleMap and paragraphStyles, I navigated through the list of StyleMapping’s.

Below you see the StyleMapping related to style “Code”:

Below you see the StyleMapping related to style “Output”:

As you can see in the examples above the attributes in the HtmlTag are by default empty and that’s not what I wanted. I could have chosen to manipulate the attributes value, but in order to be able to do so, I would have had to change code in several classes. For reasons of simplicity and because of other wanted functionality (as we will find out later) I chose to create the mapping myself, overriding in that way the default behavior of the HTML converter when using custom style maps.

In order to let the separators functionality work as with the default behavior of the HTML converter when using custom style maps, I first had to change class HtmlPath:
[in bold, I highlighted the changes]

package org.zwobble.mammoth.internal.styles;

import org.zwobble.mammoth.internal.html.HtmlNode;
import org.zwobble.mammoth.internal.html.HtmlTag;

import java.util.List;
import java.util.Map;
import java.util.function.Supplier;

import static java.util.Arrays.asList;
import static org.zwobble.mammoth.internal.util.Lists.list;
import static org.zwobble.mammoth.internal.util.Maps.map;

public interface HtmlPath {
    HtmlPath EMPTY = new HtmlPathElements(list());
    HtmlPath IGNORE = Ignore.INSTANCE;

    static HtmlPath elements(HtmlPathElement... elements) {
        return new HtmlPathElements(asList(elements));
    }

    static HtmlPath element(String tagName) {
        return element(tagName, map());
    }

    static HtmlPath element(String tagName, Map<String, String> attributes) {
        HtmlTag tag = new HtmlTag(list(tagName), attributes, false, "");
        return new HtmlPathElements(list(new HtmlPathElement(tag)));
    }

    static HtmlPath collapsibleElement(String tagName) {
        return collapsibleElement(tagName, map());
    }

    static HtmlPath collapsibleElement(List<String> tagNames) {
        return collapsibleElement(tagNames, map());
    }

    static HtmlPath collapsibleElement(String tagName, Map<String, String> attributes) {
        return collapsibleElement(list(tagName), attributes);
    }

    static HtmlPath collapsibleElement(List<String> tagNames, Map<String, String> attributes) {
        HtmlTag tag = new HtmlTag(tagNames, attributes, true, "");
        return new HtmlPathElements(list(new HtmlPathElement(tag)));
    }

    static HtmlPath collapsibleElement(String tagName, Map<String, String> attributes, String separator) {
        return collapsibleElement(list(tagName), attributes, separator);
    }

    static HtmlPath collapsibleElement(List<String> tagNames, Map<String, String> attributes, String separator) {
        HtmlTag tag = new HtmlTag(tagNames, attributes, true, separator);
        return new HtmlPathElements(list(new HtmlPathElement(tag)));
    }

    Supplier<List<HtmlNode>> wrap(Supplier<List<HtmlNode>> generateNodes);
}

Then, I added code to method visit(Paragraph paragraph, Context context) of class DocumentToHtml overriding in that way the default behavior of the HTML converter:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Paragraph paragraph, Context context) {
            Supplier<List<HtmlNode>> children = () -> {
                List<HtmlNode> content = convertChildrenToHtml(paragraph, context);
                return preserveEmptyParagraphs ? cons(Html.FORCE_WRITE, content) : content;
            };
            HtmlPath mapping = styleMap.getParagraphHtmlPath(paragraph)
                .orElseGet(() -> {
                    if (paragraph.getStyle().isPresent()) {
                        warnings.add("Unrecognised paragraph style: " + paragraph.getStyle().get().describe());
                    }
                    return HtmlPath.element("p");
                });
            if (paragraph.getStyle().isPresent()) {
                System.out.println("****Paragraph style: " + paragraph.getStyle().get().getName().get());
                if (paragraph.getStyle().get().getName().get().equals("Code")) {
                    mapping = HtmlPath.collapsibleElement("pre", map("style", "font-family: Courier New; color: black; border: 1px solid gray; padding: 5px; margin-top: 10px; margin-bottom: 10px;"), "\r\n", true, true, 0, true, true);
                } else if (paragraph.getStyle().get().getName().get().equals("Output")) {
                    mapping = HtmlPath.collapsibleElement("a", map("style", "font-family: Courier New; color: black;"), "\r\n", true, true, 0, true, true);
                }
            }
            return mapping.wrap(children).get();
        }

With the current Java code, when I run it, the output is (with the focus on a part related to style “Code”), containing the pre tag including attribute style:

With the current Java code, when I run it, the output is (with the focus on a part related to style “Output”), containing the a tag including attribute style::

Minimizing differences

My goal is of course to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of a previous blog article as input.

Below is an example of the beginning of the manually changed HTML from WordPress, after applying pretty print (in Notepad++):

Based on the example above and also with regard to for example the unordered HTML list (<ul> and <li>), I wanted to add some formatting functionality (pretty print) to the HTML converter.

With the current Java code, when I run it, the output is with regard to an unordered HTML list:

So, when constructing html, I wanted to be able to:

  • Replace a whitespace with &nbsp;
  • Have a begin tag followed by a new line
  • Set the begin tag indent level
  • Have an end tag followed by a new line
  • Have an end tag preceded by a new line

So, I changed class HtmlTag, in order to handle formatting:
[in bold, I highlighted the changes]

package org.zwobble.mammoth.internal.html;

import java.util.List;
import java.util.Map;

public class HtmlTag {
    private final List<String> tagNames;
    private final Map<String, String> attributes;
    private final boolean isCollapsible;
    private final String separator;
    private final boolean replaceWhitespaces;
    private final boolean beginTagFollowedByNewLine;
    private final int beginTagIndentLevel;
    private final boolean endTagFollowedByNewLine;
    private final boolean endTagPrecededByNewLine;

    public HtmlTag(List<String> tagNames, Map<String, String> attributes, boolean isCollapsible, String separator) {
        this.tagNames = tagNames;
        this.attributes = attributes;
        this.isCollapsible = isCollapsible;
        this.separator = separator;
        this.replaceWhitespaces = false;
        this.beginTagFollowedByNewLine = false;
        this.beginTagIndentLevel = 0;
        this.endTagFollowedByNewLine = false;
        this.endTagPrecededByNewLine = false;
    }

    public HtmlTag(List<String> tagNames, Map<String, String> attributes, boolean isCollapsible, String separator, boolean replaceWhitespaces, boolean beginTagFollowedByNewLine, int beginTagIndentLevel, boolean endTagFollowedByNewLine, boolean endTagPrecededByNewLine) {
        this.tagNames = tagNames;
        this.attributes = attributes;
        this.isCollapsible = isCollapsible;
        this.separator = separator;
        this.replaceWhitespaces = replaceWhitespaces;
        this.beginTagFollowedByNewLine = beginTagFollowedByNewLine;
        this.beginTagIndentLevel = beginTagIndentLevel;
        this.endTagFollowedByNewLine = endTagFollowedByNewLine;
        this.endTagPrecededByNewLine = endTagPrecededByNewLine;
    }

    public List<String> getTagNames() {
        return tagNames;
    }

    public Map<String, String> getAttributes() {
        return attributes;
    }

    public boolean isCollapsible() {
        return isCollapsible;
    }

    public String getSeparator() {
        return separator;
    }

    public boolean doReplaceWhitespaces() {
        return replaceWhitespaces;
    }

    public boolean isBeginTagFollowedByNewLine() {
        return beginTagFollowedByNewLine;
    }

    public int getBeginTagIndentLevel() {
        return beginTagIndentLevel;
    }

    public boolean isEndTagFollowedByNewLine() {
        return endTagFollowedByNewLine;
    }

    public boolean isEndTagPrecededByNewLine() {
        return endTagPrecededByNewLine;
    }
}

I also added the following methods in class HtmlElement:

    public boolean doReplaceWhitespaces() {
        return tag.doReplaceWhitespaces();
    }

    public boolean isBeginTagFollowedByNewLine() {
        return tag.isBeginTagFollowedByNewLine();
    }

    public int getBeginTagIndentLevel() {
        return tag.getBeginTagIndentLevel();
    }

    public boolean isEndTagFollowedByNewLine() {
        return tag.isEndTagFollowedByNewLine();
    }

    public boolean isEndTagPrecededByNewLine() {
        return tag.isEndTagPrecededByNewLine();
    }

In order to let the formatting functionality work, I changed the recently added collapsibleElement methods in class HtmlPath:
[in bold, I highlighted the changes]

    static HtmlPath collapsibleElement(String tagName, Map<String, String> attributes, String separator, boolean replaceWhitespaces, boolean beginTagFollowedByNewLine, int beginTagIndentLevel, boolean endTagFollowedByNewLine, boolean endTagPrecededByNewLine) {
        return collapsibleElement(list(tagName), attributes, separator, replaceWhitespaces, beginTagFollowedByNewLine, beginTagIndentLevel, endTagFollowedByNewLine, endTagPrecededByNewLine);
    }

    static HtmlPath collapsibleElement(List<String> tagNames, Map<String, String> attributes, String separator, boolean replaceWhitespaces, boolean beginTagFollowedByNewLine, int beginTagIndentLevel, boolean endTagFollowedByNewLine, boolean endTagPrecededByNewLine) {
        HtmlTag tag = new HtmlTag(tagNames, attributes, true, separator, replaceWhitespaces, beginTagFollowedByNewLine, beginTagIndentLevel, endTagFollowedByNewLine, endTagPrecededByNewLine);
        return new HtmlPathElements(list(new HtmlPathElement(tag)));
    }

I also changed the visit method in class HtmlWriter:
[in bold, I highlighted the changes]

            @Override
            public void visit(HtmlElement element) {
                String beginTag = "<";
                String endTag = "</";
                builder.append(beginTag).append(element.getTagName());

                HtmlWriter.generateAttributes(element.getAttributes(), builder);

                if (element.isVoid()) {
                    builder.append(" />");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                } else {
                    builder.append(">");
                    if (element.isBeginTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }

                    if (element.doReplaceWhitespaces()) {
                        StringBuilder childrenBuilder = new StringBuilder();
                        element.getChildren().forEach(child -> write(child, childrenBuilder));
                        builder.append(childrenBuilder.toString().replace(" ", "&nbsp;"));
                    } else {
                        element.getChildren().forEach(child -> write(child, builder));
                    }

                    if (element.isEndTagPrecededByNewLine()) {
                      endTag = "\r\n" + endTag;
                    }
                    builder
                        .append(endTag)
                        .append(element.getTagName())
                        .append(">");

                    if (element.isEndTagFollowedByNewLine()) {
                        builder.append("\r\n");
                    }
                }
            }

In order to use the formatting functionality, I changed the code in method visit(Paragraph paragraph, Context context) of class DocumentToHtml:
[in bold, I highlighted the changes]

        @Override
        public List<HtmlNode> visit(Paragraph paragraph, Context context) {
            Supplier<List<HtmlNode>> children = () -> {
                List<HtmlNode> content = convertChildrenToHtml(paragraph, context);
                return preserveEmptyParagraphs ? cons(Html.FORCE_WRITE, content) : content;
            };
            HtmlPath mapping = styleMap.getParagraphHtmlPath(paragraph)
                .orElseGet(() -> {
                    if (paragraph.getStyle().isPresent()) {
                        warnings.add("Unrecognised paragraph style: " + paragraph.getStyle().get().describe());
                    }
                    return HtmlPath.element("p");
                });
            if (paragraph.getStyle().isPresent()) {
                System.out.println("****Paragraph style: " + paragraph.getStyle().get().getName().get());
                if (paragraph.getStyle().get().getName().get().equals("Code")) {
                    mapping = HtmlPath.collapsibleElement("pre", map("style", "font-family: Courier New; color: black; border: 1px solid gray; padding: 5px; margin-top: 10px; margin-bottom: 10px;"), "\r\n", true, true, 0, true, true);
                } else if (paragraph.getStyle().get().getName().get().equals("Output")) {
                    mapping = HtmlPath.collapsibleElement("a", map("style", "font-family: Courier New; color: black;"), "\r\n", true, true, 0, true, true);
                }
            }
            return mapping.wrap(children).get();
        }

Empty paragraphs and New Line

In order to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, I made the following changes to method main of class MyClass1.
[in bold, I highlighted the changes]

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Set;

public class MyClass1 {
    public static void main(String[] args) throws IOException {

        DocumentConverter converter = new DocumentConverter()
                .addStyleMap("b => b")
                .addStyleMap("i => i")
                .addStyleMap("u => u")
                .addStyleMap("p[style-name='Heading'] => h2:fresh")
                .addStyleMap("p[style-name='Code'] => pre:separator('\\n')")
                .addStyleMap("p[style-name='Output'] => a:separator('\\n')")
                .preserveEmptyParagraphs();
        Result<String> result = converter.convertToHtml(new File("C:\\My\\My Documents\\AMIS\\myarticle.docx"));
        String html = result.getValue(); // The generated HTML
        html = html.replaceAll("</h2>", "</h2>\r\n");
        html = html.replaceAll("<p>", "");
        html = html.replaceAll("</p>", "\r\n"); 
        Set<String> warnings = result.getWarnings(); // Any warnings during conversion

        Path path = Paths.get("C:\\My\\My Documents\\AMIS\\myarticle.html");
        try (BufferedWriter writer = Files.newBufferedWriter(path)) {
            writer.write(html);
        }
    }
}

Remark about empty paragraphs:
DocumentConverter preserveEmptyParagraphs(): by default, empty paragraphs are ignored. Call this to preserve empty paragraphs in the output.
[https://github.com/mwilliamson/java-mammoth#documentconverter]

With the current Java code, when I run it, the output is (with the focus on a part related to style “Code”):

With the current Java code, when I run it, the output is (with the focus on a part related to style “Output”):

So now it’s time to conclude this article. In this article, I shared with you the steps I took, to even further automate the manual steps I still had to take every time I wrote a blog article. My goal is of course to minimize the differences between the manually changed HTML from WordPress and the HTML converter output, when using the Word document of a previous blog article as input.
In this article I focused on text color, text font and using custom style maps for my Word styles “Code”, “Heading” and “Output”.

In part 3 of this article I will focus on unordered HTML list (<ul> and <li>), HTML table (<table>), HTML Images (<img>) and hyperlinks.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

Oracle Cloud Infrastructure Cloud Shell - integrated OCI CLI, kubectl, terraform, SQL Plus, Docker and Maven

Facebook0TwitterLinkedinCloud Shell in Oracle Cloud Infrastructure is a free browser based command line tool for various types of interactions within your OCI Tenancy. When you start Cloud Shell, a command line is opened in your browser. You are in a Linux environment (Oracle Linux 7.7) that runs within the OCI […]