Apache NiFi: Having fun with Jolt transformations

Maarten Smeets

Jolt is a Java library which can be used to transform JSON to JSON. A Jolt transformation specification itself is also a JSON file. You can use it in products such as Apache NiFi and Apache Camel. In this blog post I’ll describe my first experiences with Jolt transformations. 

For me personally Jolt transformations are not intuitive and not that powerful when for example comparing to the capabilities of XSLT for transforming XML. It is available in Apache NiFi though and can be used without the ‘execute code’ permission. Due to an apparent lack of a suitable alternative, I decided to use it to transform JSON files. I’m a beginner with Jolt transformations so I might misunderstand basic Jolt concepts which can cause suggested solutions to be overly complex or in other ways suboptimal.

Introduction

JSON transformations

Transforming JSON can be done in multiple ways. For example programmatically, by using JSONPath expressions or by using a templating solution like Select JSON. When you need to transform a complex JSON, it can be helpful to have a file containing solely the transformation logic. This creates a separation of concerns and allows for optimization of the transformation logic/library. An example of a JSON transformation library which uses a specific transformation file is Jolt. Similar to XSLT which is an XML to transform XML to XML, Jolt is a JSON to transform JSON to JSON.

A simple example of a Jolt transformation

Source and target

When you’ve worked with transformation solutions like XSLT, you might be confused when you start to work with Jolt transformations. For example, XSLT transformations describe the target structure and how it is created from the source structure. Jolt transformations on the other hand describe the source structure and you need to indicate where in the target structure those source parts should go. In the target part of the transformation, you can refer to what is selected in the source part and you have some options to go up levels in the JSON tree and reference element names or values. For certain Jolt operations (more on that later) you can use a limited set of functions which cannot be nested and which are not very configurable.

With XSLT you can take a look at the transformation and most of the time you can more or less understand what is going on. You even have graphical mappers available for XSLT which make it even easier to drag a source to a target field. With Jolt transformations, this is more complex. It seems unlikely there will ever be a GUI for it since it is less suitable for graphical representation. 

An additional challenge is that once you have used a certain source value to map it to a target value, you cannot easily use that same source value again in a different part of the transformation. A workaround is to copy the source to several temporary structures and use the temporary structures for partial transformations if they require the use of the same values. 

Chaining

When using XSLT, functionality might be split using templates for specific parts of the transformation but usually the input is processed once and you obtain your output. When using Jolt transformations, it is usual to execute multiple transformations in a chain. The output of the first transformation is input for the next.

Operations

Jolt can execute different operations; shift, default, remove, sort, cardinality, modify-default-beta, modify-overwrite-beta. Shift is what you use to map source fields to target fields. default allows you to add new keys to the target JSON to for example assign defaults. Using remove you can remove elements. Using sort you can sort entire messages (every array, and key values of objects). Sorting is limited to alphabetic ascending and you cannot specify what needs to be sorted. Modify-default-beta will assign a value to a field if it does not exist, modify-overwrite-beta will overwrite any value even if the field already exists. The modify operations can use a limited set of functions which cannot be nested such as toUpper. These are relatively limited.

Functionality

XSLT is extensive and even the specification already contains many functions to process data. Jolt is a lot more limited and only contains a small set of functions you have to work with, although you can extend it with custom Java transformations. The out of the box Jolt transformation functions are suitable for structural changes to a JSON file but not so much for complex logic such as calculations.

When to use Jolt transformations in NiFi?

JSONPath does not suffice

Suppose you have a source JSON file and you need to put specific parts of that JSON file into attributes to base logic on but you are unable to get to those elements easily using JSONPath expressions. Using Jolt transformations you can prepare your JSON so you are able to get to those elements more easily.

You need your JSON in a different form

Suppose you receive input from a source and need to create a message to call an API to retrieve additional information based on that input? Using a transformation you can create JSON in the correct format.

Keep information in the flowfile content and not in attributes

You can create flowfile attributes which contain information on which routing can take place, for example from JSONPath expressions. Suppose your source JSON contains a lot of information which could potentially be used for routing. You might not want to individually assign each field to an attribute but you might want to expand your source flowfile with those attributes in a suitable form to allow easy routing based on them.

There are certain processors which can remove attributes as a byproduct of what it does, such as the MergeContent processor. In such a case, saving relevant information in the flowfile content might be a way to not lose it.

How to use Jolt transformations in NiFi

There are 2 processors in NiFi which allow the usage of Jolt transformations; JoltTransformJSON and JoltTransformRecord. The JoltTransformRecord uses a record reader and a record writer while the JoltTransformJSON expects to find a JSON flowfile content to work with. In both cases the output ends up in the flowfile contents. The configuration of the transformation is the same.

You have 2 options. Specify a Jolt Specification by pasting the specification in text into the property. This has no external dependencies. The second option allows you to refer to a custom Java class and additional modules on the NiFi filesystem to provide a transformation. See for example here. This requires you to place the relevant files on the NiFi server, which might be challenging in CI/CD setups to automate. When however you want to do more complex transformations, a custom transformation can be considered.

Usually you would want to use the Chain Jolt transformation DSL. Jolt is very limited in what it can do and even little things like a key value lookup can be challenging. By chaining transformations, you can do complex things in small steps.

An example

For example we have the following input sample message:

{
  "eventInstanceId": "a3f639bd-62e1-fe0c-9036-460fdae17abc",
  "eventName": "ObjectStateUpdated",
  "eventTime": "2021-04-16T11:41:27Z",
  "eventSource": "SourceSystem",
  "eventSubject": {
    "objectIdentifier": "ABC"
  },
  "data": {
    "objectIdentifier": "ABC",
    "setpoints": [
      {
        "from": "2021-04-16T10:00:00Z",
        "to": "2021-04-16T10:15:00Z",
        "state": "ACTIVE"
      },
      {
        "from": "2021-04-16T10:15:00Z",
        "to": "2021-04-16T10:30:00Z",
        "state": "ACTIVE"
      },
      {
        "from": "2021-04-16T10:30:00Z",
        "to": "2021-04-16T10:45:00Z",
        "state": "INACTIVE"
      },
      {
        "from": "2021-04-16T10:15:00Z",
        "to": "2021-04-16T10:30:00Z",
        "state": "INACTIVE"
      }
    ]
  }
}

We want to validate this message. First we want to know the first from datetime and the last from datetime so we can check the complete duration which is covered in the message. In addition, we want to know if there are no 2 setpoints which cover the same period (in this example, this is the case). This seems like it would not be difficult to do, however consider that using scripts to achieve this in NiFi requires the ‘execute code’ permission which basically gives you access to everything the NiFi user can do. This more or less bypasses the fine grained access model NiFi offers and is a security risk.

A challenges in the above validation is that the setpoints are not sorted and we have no easy way to sort the setpoints based on the from time.

I created the following Jolt transformation chain to achieve the above;

[{
        "operation": "shift",
        "spec": {
            "*": {
                "@": ["tmp1.&", "tmp2.&"]
            }
        }
    }, {
        "operation": "shift",
        "spec": {
            "tmp1": {
                "data": {
                    "setpoints": {
                        "*": {
                            "@from": "newsetpoints.@from.from",
                            "@to": "newsetpoints.@from.to",
                            "@state": "newsetpoints.@from.state"
                        }
                    }
                }
            },
            "tmp2": "tmp2"
        }
    }, {
        "operation": "sort",
        "spec": {
            "*": ""
        }
    }, {
        "operation": "shift",
        "spec": {
            "tmp2": "tmp2",
            "newsetpoints": {
                "*": {
                    "to": "orderedsetpoints[#2].to",
                    "from": "orderedsetpoints[#2].from",
                    "state": "orderedsetpoints[#2].state"
                }
            }
        }
    }, {
        "operation": "remove",
        "spec": {
            "tmp2": {
                "data": {
                    "setpoints": ""
                }
            }
        }
    }, {
        "operation": "shift",
        "spec": {
            "tmp2": {
                "*": "&"
            },
            "orderedsetpoints": {
                "*": "data.setpoints"
            }
        }
    }
]

What does this Jolt transformation chain do? First I create to temporary copies of my message; tmp1 and tmp2. tmp2 will eventually become my output and I use tmp1 to sort the setpoints. I create a setpoints object which has the from time as the key and the rest of the fields as an object under that key (newsetpoints). This way I can sort the message and the keys will be sorted. After having sorted the keys, I transform the setpoints to an array again (orderedsetpoints). Next I remove the original setpoints. In the last step I create the output by combining tmp2 and the orderedsetpoints. The output is;

{
  "data" : {
    "objectIdentifier" : "ABC",
    "setpoints" : [ {
      "to" : "2021-04-16T10:15:00Z",
      "from" : "2021-04-16T10:00:00Z",
      "state" : "ACTIVE"
    }, {
      "to" : [ "2021-04-16T10:30:00Z", "2021-04-16T10:30:00Z" ],
      "from" : [ "2021-04-16T10:15:00Z", "2021-04-16T10:15:00Z" ],
      "state" : [ "ACTIVE", "INACTIVE" ]
    }, {
      "to" : "2021-04-16T10:45:00Z",
      "from" : "2021-04-16T10:30:00Z",
      "state" : "INACTIVE"
    } ]
  },
  "eventInstanceId" : "a3f639bd-62e1-fe0c-9036-460fdae17abc",
  "eventName" : "ObjectStateUpdated",
  "eventSource" : "SourceSystem",
  "eventSubject" : {
    "objectIdentifier" : "ABC"
  },
  "eventTime" : "2021-04-16T11:41:27Z"
}

As you can see, the setpoints are sorted based on the from field. Now it is easy to pick the first and last element of the array to get the first from datetime and last to datetime. Also it is now easy to identify the setpoints which have the same from field since the values of the fields have become arrays. A thing I don’t like about this transformation however is that all the keys in the message have been sorted. This can give issues with for example AVRO schema. I could explicitly put them in the correct order by adding another transformation to the chain if I wanted to.

Some tips

  • Jolt transformations are not streaming transformations. JSON files are kept in memory. This can give challenges with very large JSON. In those cases first split it up and process individual parts.
  • Try out your transformations online first. For example here. This is a lot quicker than changing the transformation in NiFi every time and trying it out.
  • Don’t try to do everything at once but split up the transformation in small steps since even the small steps can already become difficult to create.
  • Some things are difficult (maybe even impossible when not using custom transformers). In those cases creativity is required to make things work.
    • Key/value lookups are not straightforward. You need to use multiple sources to achieve this; the field which contains a key to be looked up and a reference object containing a value. These sources need to be combined and used to assign to a target value. It can probably be done but I did not manage to achieve it or find a simple example for it.
    • Sorting specific elements in a message. The sort operation is not specific and sorts everything (array contents, keys in objects). The modify-default-beta and modify-overwrite-beta allow using a sort function (see here). This function can sort arrays but cannot easily sort for example objects based on a key inside the object. The sort function also does not recursively sort. Thus you only have an all or nothing sort available or a sort specific to arrays of values.
  • You can only use an input source with many nested objects once in the same (shift) Jolt transformation. If you want to use your input data multiple times to do very different things with, you can also copy it to multiple elements and use one of those elements per transformation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

Creating a minimal container image for a Go application

How hard can it be – to create the smallest possible container image to run a Go application? It is not hard to create a container image that contains an binary executable file – and that is what a Go application turns into during the build process. However, that does […]
%d bloggers like this: