Core Java

Apache Avro Serialize Enum Values Example

Apache Avro is a popular data serialization system that supports rich data structures, compact encoding, and schema evolution. When working with Avro in Java, especially when using enums, handling null values properly is critical to maintaining forward and backward compatibility. Let us delve into understanding how Java and Apache Avro work together to serialize enum values effectively.

1. Understanding Avro and Avro Enum Serialization

Apache Avro is a compact, fast, binary data serialization system commonly used with big data tools such as Apache Hadoop. It allows for rich data structures defined using JSON-based schemas and supports schema evolution, making it ideal for storing structured data.

In Java, Avro serialization is widely used to persist and transmit data efficiently across systems. One of the powerful features of Avro is its support for enums. Enums in Avro are serialized as strings but must conform to a predefined set of symbols defined in the schema. This allows for safe and consistent use of controlled vocabulary values across different systems.

Moreover, it’s common to allow null values alongside enums to represent optional or missing values. Avro supports this through the use of union types, which can combine null with other types, like enum.

{
  "type": "record",
  "name": "Employee",
  "namespace": "com.example.avro",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "role",
      "type": ["null", { "type": "enum", "name": "Role", "symbols": ["ENGINEER", "MANAGER", "HR"] }],
      "default": null
    }
  ]
}

In the schema above:

  • The name field is a simple string.
  • The role field uses a union of null and an enum named Role.
  • Valid enum values include: ENGINEER, MANAGER, and HR.
  • The default value is set to null, meaning the field is optional during serialization.

Using this setup, Java applications can serialize and deserialize enum values in Avro format using libraries like Avro Java API, which handles encoding and decoding under the hood.

1.1 Schema evolution

Avro supports schema evolution, which allows readers to consume data written with older versions of a schema. However, when working with enums, certain rules must be followed: removing enum symbols is not backward-compatible; adding enum symbols is backward-compatible; and changing the order of symbols can cause deserialization issues if using schema-less serialization.

For example, when adding a new enum symbol, the updated schema might look like this:

{ "type": "record", "name": "Employee", "fields": [ { "name": "name", "type": "string" }, { "name": "role", "type": ["null", { "type": "enum", "name": "Role", "symbols": ["ENGINEER", "MANAGER", "HR", "INTERN"] }], "default": null } ] }

This updated schema is forward-compatible, but if a consumer does not recognize the newly added INTERN symbol, deserialization may fail unless schema resolution strategies are implemented. Additionally, if a writer includes a symbol like HR that the reader’s schema has removed, the data becomes unreadable unless fallback logic or default values are provided.

2. Code Example

In Java, you typically generate classes from an Avro schema using tools like the Apache Avro Tools command-line utility or through Maven build plugins. This process takes a .avsc schema file and generates corresponding Java classes, which include the records and enums defined in the schema.

This code generation is crucial because it ensures type safety and simplifies serialization and deserialization when working with Avro data. Once the schema is compiled, developers can use the generated classes directly in their applications for reading and writing Avro-encoded data.

In Java, you typically generate classes from Avro schema using Avro Tools or Maven plugins. The above schema will generate the following classes.

2.1 How to Generate Java Classes from Avro Schema

2.1.1 Using Avro Tools (Command Line)

java -jar avro-tools-1.11.1.jar compile schema employee.avsc ./output-directory

This will generate Java files (e.g., Employee.java and Role.java) into the specified output directory.

2.1.2 Using Maven Plugin

Add the following plugin configuration to your pom.xml:

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.avro</groupId>
      <artifactId>avro-maven-plugin</artifactId>
      <version>1.11.1</version>
      <executions>
        <execution>
          <phase>generate-sources</phase>
          <goals>
            <goal>schema</goal>
          </goals>
          <configuration>
            <sourceDirectory>${project.basedir}/src/main/avro</sourceDirectory>
            <outputDirectory>${project.basedir}/src/main/java</outputDirectory>
          </configuration>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

Place your schema file (e.g., employee.avsc) in the src/main/avro directory. When you build the project, the Java classes will be auto-generated.

2.2 Generated Java Enum

From the Avro schema’s enum definition, the following Java enum class is generated:

public enum Role {
    ENGINEER, MANAGER, HR;
}

2.3 Generated Java Record

The record definition in the Avro schema is converted into a Java class that extends SpecificRecordBase, which is part of the Avro Java library:

public class Employee extends SpecificRecordBase {
    private CharSequence name;
    private Role role;

    // getters and setters
}

2.4 Serializing Enum Values

// AvroEnumExample.java

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import com.example.avro.Employee;
import com.example.avro.Role;

import java.io.File;

public class AvroEnumExample {

    public static void main(String[] args) throws Exception {
        // Create an Employee with null role
        Employee emp1 = new Employee();
        emp1.setName("Alice");
        emp1.setRole(null);

        // Create an Employee with non-null role
        Employee emp2 = new Employee();
        emp2.setName("Bob");
        emp2.setRole(Role.ENGINEER);

        // Serialize to file
        File file = new File("employees.avro");
        SpecificDatumWriter<Employee> writer = new SpecificDatumWriter<>(Employee.class);
        DataFileWriter<Employee> dataFileWriter = new DataFileWriter<>(writer);
        dataFileWriter.create(emp1.getSchema(), file);
        dataFileWriter.append(emp1);
        dataFileWriter.append(emp2);
        dataFileWriter.close();

        // Deserialize from file
        SpecificDatumReader<Employee> reader = new SpecificDatumReader<>(Employee.class);
        DataFileReader<Employee> dataFileReader = new DataFileReader<>(file, reader);

        while (dataFileReader.hasNext()) {
            Employee emp = dataFileReader.next();
            System.out.println("Name: " + emp.getName() + ", Role: " + emp.getRole());
        }
        dataFileReader.close();
    }
}

These generated classes are then used for serializing and deserializing Avro data. With tools like the SpecificDatumWriter and SpecificDatumReader, Java applications can write to or read from binary Avro files using these types safely and efficiently.

2.4.1 Code Example

The AvroEnumExample.java program demonstrates how to use Apache Avro in Java to serialize and deserialize records containing enum fields, including support for null values. It begins by creating two Employee instances — one with a null role and another with a Role.ENGINEER. These records are then serialized into a file named employees.avro using SpecificDatumWriter and DataFileWriter, which handle the writing of strongly-typed Avro data. The schema used during serialization is obtained directly from the Employee instance. After writing, the program reads the same file using SpecificDatumReader and DataFileReader, iterating through the records and printing each employee’s name and role to the console. This example highlights how Avro supports nullable enums via union types, and how generated Java classes can be used to handle Avro data seamlessly in both writing and reading operations.

2.4.2 Code Output

When the AvroEnumExample.java program is executed, it creates and writes two Employee records to an Avro file — one with a null role and the other with the role set to ENGINEER. Upon reading the records back from the file and printing them, the following output is displayed:

Name: Alice, Role: null
Name: Bob, Role: ENGINEER

This output confirms that Apache Avro successfully handles both nullable enum values and valid enum entries during serialization and deserialization, preserving the structure and data integrity defined in the schema.

3. Conclusion

Storing null values in Avro, especially for enums, is straightforward using union types like ["null", "enum"]. Proper serialization logic, combined with careful schema evolution strategies, ensures data integrity across versions.
Enum fields should always be handled with default values and nullable types to avoid compatibility pitfalls. This makes Avro a powerful and flexible format for Java applications dealing with evolving data models.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button