Introduction to Apache Beam Using Java: A Beginner-Friendly Guide to Unified Data Processing

Eleftheria DrosopoulouMay 15th, 2025Last Updated: May 9th, 2025

0 453 2 minutes read

Apache Beam is an open-source, unified model for defining both batch and streaming data-parallel processing pipelines. It allows developers to write data processing jobs that can run on various execution engines like Apache Flink, Apache Spark, and Google Cloud Dataflow. In this article, we’ll explore Apache Beam through the lens of Java, its most mature and widely supported SDK.

📌 What is Apache Beam?

Apache Beam provides a high-level programming model that abstracts away the complexities of data processing engines. Instead of writing platform-specific code, you can write your pipeline once and execute it anywhere supported by the Beam runners.

🔧 Core Features:

Unified API for batch and streaming
Portable across multiple runners (Flink, Spark, Dataflow)
Support for windowing, event time, triggers, and watermarks
SDKs available in Java, Python, and Go (Java being the most complete)

☕ Why Use Java for Apache Beam?

Java is the primary SDK for Apache Beam, meaning it has the broadest support and most up-to-date features.

Benefits:

Mature API and documentation
Better performance tuning options
Widely used in enterprise systems
Compatible with Maven and Gradle for dependency management

🚀 Getting Started with Apache Beam (Java)

📦 Step 1: Set Up the Maven Project

Add the following dependencies to your pom.xml:

<dependencies>
  <dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-core</artifactId>
    <version>2.52.0</version>
  </dependency>
  <dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-direct-java</artifactId>
    <version>2.52.0</version>
  </dependency>
</dependencies>

💡 Tip: The Direct Runner is useful for local development and testing.

🛠️ Step 2: Create a Simple Beam Pipeline

Here’s a basic example that reads a list of strings, transforms them, and writes the result to console:

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.values.PCollection;

public class SimplePipeline {
    public static void main(String[] args) {
        Pipeline pipeline = Pipeline.create();

        PCollection<String> lines = pipeline.apply(
            "ReadLines", 
            TextIO.read().from("input.txt")
        );

        PCollection<String> output = lines.apply(
            MapElements.into(TypeDescriptors.strings())
                       .via((String line) -> line.toUpperCase())
        );

        output.apply(
            TextIO.write().to("output").withSuffix(".txt").withoutSharding()
        );

        pipeline.run().waitUntilFinish();
    }
}

⏱️ Batch vs Streaming in Beam

In Beam, there’s no need to choose between batch and streaming early on. The API is unified. You define your pipeline, then configure it for either batch or streaming at runtime.

Batch: Processes bounded datasets
Streaming: Processes unbounded data in near real-time

Example:

>PipelineOptions options = PipelineOptionsFactory.create();
options.setStreaming(true); // Set to false for batch
Pipeline pipeline = Pipeline.create(options);

The diagram below illustrates the architecture of an Apache Beam pipeline. It highlights the core flow from data input, through transformations, and finally to output, showcasing Beam’s unified model for both batch and streaming processing. At the bottom, the supported runners—such as Flink, Spark, and Google Cloud Dataflow—demonstrate the portability of Beam pipelines across different execution environments.

🔍 Real-World Use Cases

ETL Pipelines: Ingest, transform, and load data into warehouses
Log Processing: Stream and analyze server logs in near-real-time
IoT Analytics: Process telemetry from connected devices
Clickstream Analysis: Understand user behavior across platforms

Introduction to Apache Beam Using Java: A Beginner-Friendly Guide to Unified Data Processing

📌 What is Apache Beam?

🔧 Core Features:

☕ Why Use Java for Apache Beam?

Benefits:

🚀 Getting Started with Apache Beam (Java)

📦 Step 1: Set Up the Maven Project

🛠️ Step 2: Create a Simple Beam Pipeline

⏱️ Batch vs Streaming in Beam

🔍 Real-World Use Cases

🌐 Learn More

Optional Visual: Pipeline Architecture

Thank you!

Eleftheria Drosopoulou

Thank you!

📌 What is Apache Beam?

🔧 Core Features:

☕ Why Use Java for Apache Beam?

Benefits:

🚀 Getting Started with Apache Beam (Java)

📦 Step 1: Set Up the Maven Project

🛠️ Step 2: Create a Simple Beam Pipeline

⏱️ Batch vs Streaming in Beam

🔍 Real-World Use Cases

🌐 Learn More

Optional Visual: Pipeline Architecture

Thank you!

Related Articles

Thank you!