Merge AVRO schema and generate random data or Java classes

Maarten Smeets

Previously I wrote about generating random data which conforms to an AVRO schema (here). In a recent use-case, I encountered the situation where there were several separate schema files containing different AVRO types. The message used types from those different files. For the generation of random data, I first needed to merge the different files into a single schema. In addition, I wanted to generate Java classes for the complete message which required importing dependent types in the pom.xml. In this blog post I’ll describe how I did that.

Merge AVRO schema and generate random JSON data or Java classes from the resulting merged schema

Merge the schema

For merging of schema, there are several solutions available. Several can be found here. I found the following to easily load the AVRO schema files into a single Map. The Apache AVRO libraries and related contain AvroStorageUtils.mergeSchema. This however seems to have some serious limitations. The Kite-SDK contains a SchemaUtil class which does a much better job so I decided to use that. In the below example I merge 2 schema.

First I added the following dependencies to my pom.xml;

<dependency>
    <groupId>org.kitesdk</groupId>
    <artifactId>kite-data-core</artifactId>
    <version>1.1.0</version>
</dependency>
<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-annotations</artifactId>
    <version>2.13.1</version>
</dependency>

Below I used a list of schema files to merge (Arrays.asList). When the number of files becomes larger, you will probably want to populate this list based on a listing of the directory which contains the schema files. Next I used org.kitesdk.data.spi.SchemaUtil to merge the schema after having loaded them. 

AvroTest me = new AvroTest();
ClassLoader classLoader = me.getClass().getClassLoader();

List<String> schemaResourceNames = Arrays.asList("file1.avsc", "file2.avsc");
Schema.Parser parser = new Schema.Parser();

for (String schemaResourceName : schemaResourceNames) {
    try (InputStream schemaInputStream = classLoader.getResourceAsStream(schemaResourceName)) {
        if (schemaInputStream == null) {
            throw new RuntimeException("Resource not found " + schemaResourceName);
        }
        parser.parse(schemaInputStream);
    }
}

Schema mergedSchema = SchemaUtil.mergeOrUnion(parser.getTypes().values());

Generate random data

The merged schema contains all different types. When not specifying a specific type, it will generate random data for a random type. This is usually not what you want so you first need to get to the schema which contains your message type.

//Generate a random JSON which conforms to the AVRO schema
RandomData rd = null;
for (Schema myType : mergedSchema.getTypes()) {
    if (myType.getName().equals("YOURMESSAGETYPE")) {
        rd = new RandomData(myType, 1);
        System.out.println(rd.iterator().next());
    }
}

Generate Java classes

Since a single AVRO file does not contain the complete AVRO definition of what needs to be generated, without additional actions, the generation of Java classes will fail. To fix this, you can import additional types in your pom.xml file. The below snippet is from my pom.xml file, the avro-maven-plugin.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-maven-plugin</artifactId>
            <version>${avro.version}</version>
            <configuration>
                <sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory>
                <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
                <fieldVisibility>PRIVATE</fieldVisibility>
                <imports>
                    <import>${project.basedir}/src/main/resources/file1.avsc</import>
                </imports>
                <includes>
                    <include>**/*.avsc</include>
                </includes>
            </configuration>
            <executions>
                <execution>
                    <phase>generate-sources</phase>
                    <goals>
                        <goal>schema</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

file1.avsc defines a type which is used in another schema in the resources folder. Without the import line the generation will fail. You can also specify a directory here, which is more flexible and does not require you to change your pom.xml file for every schema which is added. A limitation is that imported files cannot reference each other (see the source code of the avro-maven-plugin here and look at the note above the ‘imports’ parameter). This limits code generation of nested AVRO schema. If you do need to generate Java code from nested AVRO schema, you can obtain the merged schema type of your message from your Java code, as shown above, by obtaining it from the merged schema which is generated by SchemaUtil.mergeOrUnion. Next you can use the org.apache.avro.Schema toString method on the resulting schema, save it as avsc file and use that for Java class generation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

Trouble shooting while upgrading a VM with Minikube, Elasticsearch and Kibana (using the --vm-driver=none option) –on my Windows laptop using Vagrant and Oracle VirtualBox

For a demo, I needed an environment including Elasticsearch and Kibana (Elastic Stack). Lucky for me, I had the configuration for such an environment using Vagrant and Oracle VirtualBox. In the past, I already set up such a demo environment, available within an Oracle VirtualBox appliance, as I described in […]
%d bloggers like this: