SlideShare a Scribd company logo
Apache Apex Meetup
Big Data File Ingestion
using Apex
Sandeep Deshmukh, PhD
sandeep@apache.org
Apache Apex Meetup
Contents
● What is Big Data Ingestion
● Challenges in File copy @ scale
● Ingestion using Apex
○ Input
○ Output
○ Key features
● Demo
● Summary
Apache Apex Meetup
Directed Acyclic Graph (DAG)
•A Stream is a sequence of data tuples
•An Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance in single-threaded
•Directed Acyclic Graph (DAG) is made up of operators and streams
Apex: Application Programming Model
Output StreamTuple Tuple
FilteredStream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Filtered
Stream
Apache Apex Meetup
What is Ingestion
Data ingestion
● process of obtaining, importing, and processing data for later use or storage
in a database
Big Data Ingestion
● discovering the data sources
● importing the data
● processing data to produce intermediate data
● Send data out to durable data stores
Apache Apex Meetup
Challenges in File copy @ scale
● Failure Recovery
● Copying big files in parallel
● Copying large number of small files
● Processing
○ Encryption
○ Compression
○ Compaction
Apache Apex Meetup
DAG - Components
Read Data Write Data
Process
Apache Apex Meetup
DAG - Read Data : Requirements
● Independent of input file type
○ HDFS
○ S3
○ FTP
○ NFS
● Scale to large data
○ Large files
○ Large number of small files
● Configurable Bandwidth usage
Apache Apex Meetup
DAG - Read Data
Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data
Assign smaller tasks for
downstream operators
StepsPurposeName
Work on the sub-tasks
given by Operator 1, one
at a time
Connect to source and
read data as smaller
tasks one-by-one
Pass on the read data to
downstream operator
Write File
Save the data read by
Operator 2
File
Splitter
Block
Reader
File
Writer
Apache Apex Meetup
DAG - Simple Design
File
Splitter
Block
Reader
File
WriterBlockMetaData Data
Challenges
● Reading files in parallel is not possible
○ Can have multiple Block Readers and File Writers reading multiple files in
parallel but single file can’t be read by two Block Readers
● Failure recovery is hard
Apache Apex Meetup
DAG - Read Data
Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data
Assign smaller tasks
for downstream
operators
StepsPurposeName
Work on the sub-tasks
given by Operator 1,
one at a time
Connect to source and
read data as smaller
tasks one-by-one
Pass on the read data to
downstream operator
Write File
Save the data read
by Operator 2
File
Splitter
Block
Reader
File
Writer
Check for completeness
Make sure all smaller
tasks for a file are
completed by upstream
operators & send file
merger trigger
Synchronizer
Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
BlockMetaData
Data
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
FileMetaData
BlockMetaData
BlockMetaData
Data
Apache Apex Meetup
Input DAG - FileSplitter
Scan input files/ directories
Create smaller sub-tasks
FileMetaData
BlockMetaData
File
Splitter
● Parameters
○ input files/directories to copy data from
○ recursive - Yes / No
○ polling - Yes / No
○ bandwidth - MB / sec
Apache Apex Meetup
Input DAG - FileSplitter
● For each file in the directory:
■ [output] FileMetaData - file information
● Name
● Size
● Relative path
● Block IDs into which the file is virtually split
■ [output] BlockMetaData - block information
● BlockID
● Start position
● End position
● File URL
InputFile.txt
1073741824 (1GB)
input/data/InputFile.txt
[0,1,2,3,4,5,6,7,8]
1
134217728
268435456 (128MB)
hdfs://node18:8020/user/sandeep/input
Apache Apex Meetup
Input DAG - BlockReader
Block
Reader
Read block from remote
location and emit Data
Data
BlockMetaData
● Parameters
Input URL: E.g.: hdfs://node18:8020/user/hduser/input
BlockMetaData
Apache Apex Meetup
Input DAG - BlockWriter
Block
Writer
Write block data on local
HDFS
BlockMetaData
BlockMetaData
Data
Saves data in apps directory
Apache Apex Meetup
Input DAG - Synchronizer
Track blocks for each file
and send trigger once all
the block for that file
are available
FileMetaDataSynchronizer
FileMetaData
BlockMetaData
Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
Data
FileMetaData
BlockMetaData
BlockMetaData
FileMetaData
Apache Apex Meetup
Output DAG - FileMerger
Merge blocks to recreate
original file
FileMerger
● Parameters
○ Output directory to copy data to
○ Overwrite - Yes/No
FileMetaData
Apache Apex Meetup
Output DAG - FileMerger - FastMerge Magic
Different
Blocks:
File :
B1
DataNode1DataNode2
DataNode3DataNode4
B2
B1
B1
B2
B2
Bn
Bn
Bn
BnB1 B2
1
2
1
1
2
2
n
n
n
1 2 n
Apache Apex Meetup
● Same replication factor
● On same HDFS cluster
● Same block size for all files
● Size of all files (except last) : multiple of block size
Output DAG - FileMerger - FastMerge Magic
Apache Apex Meetup
DAG - Complete
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData BlockMetaData
Data
BlockMetaData
FileMetaData
FileMerger
FileMetaData
Apache Apex Meetup
Other features: Optional processing
● Compression
○ Gzip and lzo
● Encryption
○ PKI & AES
● Compaction
○ Size based
● Dedup
● Dimension Computation & Aggregation
Apache Apex Meetup
Apache Apex Meetup
Summary
● Easy to use
○ Configure and run
● Unified for batch and continuous ingestion
● Handles
○ Large files
○ Large number of small files
25
Resources
Apache Apex Meetup
• Apache Apex website - http://apex.incubator.apache.org/
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Startup Program – Free Enterprise License for startups, Universities, Non-Profits

More Related Content

What's hot (20)

PPTX
Building real time Data Pipeline using Spark Streaming
datamantra
 
PDF
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Java High Level Stream API
Apache Apex
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PPTX
Introduction to Apache Apex
Apache Apex
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Fault-Tolerant File Input & Output
Apache Apex
 
PDF
Structured Streaming with Kafka
datamantra
 
PDF
Oracle to PostgreSQL migration
strikr .
 
PDF
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
PPT
Introduction to Spark Streaming
Knoldus Inc.
 
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Building real time Data Pipeline using Spark Streaming
datamantra
 
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Java High Level Stream API
Apache Apex
 
Introduction to Flink Streaming
datamantra
 
Introduction to Spark Streaming
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
Deep Dive into Apache Apex App Development
Apache Apex
 
Introduction to Apache Apex
Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Fault-Tolerant File Input & Output
Apache Apex
 
Structured Streaming with Kafka
datamantra
 
Oracle to PostgreSQL migration
strikr .
 
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Introduction to Spark Streaming
Knoldus Inc.
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 

Similar to Ingestion file copy using apex (20)

PDF
Intro to Big Data - Spark
Sofian Hadiwijaya
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPT
hadoop_spark_Introduction_Bigdata_intro.ppt
anuroopdv
 
PPT
hadoop-spark.ppt
NouhaElhaji1
 
PPT
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
sasuke20y4sh
 
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPTX
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
PDF
Integration Patterns for Big Data Applications
Michael Häusler
 
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PPTX
Apache Crunch
Alwin James
 
PPTX
Big Data Processing
Michael Ming Lei
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPTX
Hadoop by kamran khan
KamranKhan587
 
PPTX
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Geoffrey Fox
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PPTX
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
hadoop_spark_Introduction_Bigdata_intro.ppt
anuroopdv
 
hadoop-spark.ppt
NouhaElhaji1
 
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
sasuke20y4sh
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Introduction to Hadoop
Ovidiu Dimulescu
 
Big Data and Cloud Computing
Farzad Nozarian
 
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Integration Patterns for Big Data Applications
Michael Häusler
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Apache Crunch
Alwin James
 
Big Data Processing
Michael Ming Lei
 
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
Dilip Reddy
 
Hadoop by kamran khan
KamranKhan587
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Geoffrey Fox
 
Hadoop Big Data A big picture
J S Jodha
 
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
Ad

More from Apache Apex (18)

PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PDF
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PPTX
Hadoop Interacting with HDFS
Apache Apex
 
PPTX
Introduction to Real-Time Data Processing
Apache Apex
 
PPTX
Introduction to Yarn
Apache Apex
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
HDFS Internals
Apache Apex
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
PPTX
Apache Apex & Bigtop
Apache Apex
 
PDF
Building Your First Apache Apex Application
Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Real-Time Data Processing
Apache Apex
 
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Apache Apex
 
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Apache Beam (incubating)
Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Apache Apex
 
Ad

Recently uploaded (20)

PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
NSEST - 2025-Brochure srm institute of science and technology
MaiyalaganT
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PDF
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PPTX
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
NSEST - 2025-Brochure srm institute of science and technology
MaiyalaganT
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
Mynd company all details what they are doing a
AniketKadam40952
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
Kafka Use Cases Real-World Applications
Accentfuture
 
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 

Ingestion file copy using apex

  • 1. Apache Apex Meetup Big Data File Ingestion using Apex Sandeep Deshmukh, PhD [email protected]
  • 2. Apache Apex Meetup Contents ● What is Big Data Ingestion ● Challenges in File copy @ scale ● Ingestion using Apex ○ Input ○ Output ○ Key features ● Demo ● Summary
  • 3. Apache Apex Meetup Directed Acyclic Graph (DAG) •A Stream is a sequence of data tuples •An Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance in single-threaded •Directed Acyclic Graph (DAG) is made up of operators and streams Apex: Application Programming Model Output StreamTuple Tuple FilteredStream Enriched Stream Enriched Stream er Operator er Operator er Operator er Operator Filtered Stream
  • 4. Apache Apex Meetup What is Ingestion Data ingestion ● process of obtaining, importing, and processing data for later use or storage in a database Big Data Ingestion ● discovering the data sources ● importing the data ● processing data to produce intermediate data ● Send data out to durable data stores
  • 5. Apache Apex Meetup Challenges in File copy @ scale ● Failure Recovery ● Copying big files in parallel ● Copying large number of small files ● Processing ○ Encryption ○ Compression ○ Compaction
  • 6. Apache Apex Meetup DAG - Components Read Data Write Data Process
  • 7. Apache Apex Meetup DAG - Read Data : Requirements ● Independent of input file type ○ HDFS ○ S3 ○ FTP ○ NFS ● Scale to large data ○ Large files ○ Large number of small files ● Configurable Bandwidth usage
  • 8. Apache Apex Meetup DAG - Read Data Break the whole task into smaller sub-tasks Connect to input and scan for available data Assign smaller tasks for downstream operators StepsPurposeName Work on the sub-tasks given by Operator 1, one at a time Connect to source and read data as smaller tasks one-by-one Pass on the read data to downstream operator Write File Save the data read by Operator 2 File Splitter Block Reader File Writer
  • 9. Apache Apex Meetup DAG - Simple Design File Splitter Block Reader File WriterBlockMetaData Data Challenges ● Reading files in parallel is not possible ○ Can have multiple Block Readers and File Writers reading multiple files in parallel but single file can’t be read by two Block Readers ● Failure recovery is hard
  • 10. Apache Apex Meetup DAG - Read Data Break the whole task into smaller sub-tasks Connect to input and scan for available data Assign smaller tasks for downstream operators StepsPurposeName Work on the sub-tasks given by Operator 1, one at a time Connect to source and read data as smaller tasks one-by-one Pass on the read data to downstream operator Write File Save the data read by Operator 2 File Splitter Block Reader File Writer Check for completeness Make sure all smaller tasks for a file are completed by upstream operators & send file merger trigger Synchronizer
  • 11. Apache Apex Meetup DAG - Input File Splitter Block Reader Block Writer BlockMetaData Data Block Reader Block Writer Synchronizer BlockMetaData FileMetaData BlockMetaData BlockMetaData Data
  • 12. Apache Apex Meetup Input DAG - FileSplitter Scan input files/ directories Create smaller sub-tasks FileMetaData BlockMetaData File Splitter ● Parameters ○ input files/directories to copy data from ○ recursive - Yes / No ○ polling - Yes / No ○ bandwidth - MB / sec
  • 13. Apache Apex Meetup Input DAG - FileSplitter ● For each file in the directory: ■ [output] FileMetaData - file information ● Name ● Size ● Relative path ● Block IDs into which the file is virtually split ■ [output] BlockMetaData - block information ● BlockID ● Start position ● End position ● File URL InputFile.txt 1073741824 (1GB) input/data/InputFile.txt [0,1,2,3,4,5,6,7,8] 1 134217728 268435456 (128MB) hdfs://node18:8020/user/sandeep/input
  • 14. Apache Apex Meetup Input DAG - BlockReader Block Reader Read block from remote location and emit Data Data BlockMetaData ● Parameters Input URL: E.g.: hdfs://node18:8020/user/hduser/input BlockMetaData
  • 15. Apache Apex Meetup Input DAG - BlockWriter Block Writer Write block data on local HDFS BlockMetaData BlockMetaData Data Saves data in apps directory
  • 16. Apache Apex Meetup Input DAG - Synchronizer Track blocks for each file and send trigger once all the block for that file are available FileMetaDataSynchronizer FileMetaData BlockMetaData
  • 17. Apache Apex Meetup DAG - Input File Splitter Block Reader Block Writer Synchronizer BlockMetaData Data FileMetaData BlockMetaData BlockMetaData FileMetaData
  • 18. Apache Apex Meetup Output DAG - FileMerger Merge blocks to recreate original file FileMerger ● Parameters ○ Output directory to copy data to ○ Overwrite - Yes/No FileMetaData
  • 19. Apache Apex Meetup Output DAG - FileMerger - FastMerge Magic Different Blocks: File : B1 DataNode1DataNode2 DataNode3DataNode4 B2 B1 B1 B2 B2 Bn Bn Bn BnB1 B2 1 2 1 1 2 2 n n n 1 2 n
  • 20. Apache Apex Meetup ● Same replication factor ● On same HDFS cluster ● Same block size for all files ● Size of all files (except last) : multiple of block size Output DAG - FileMerger - FastMerge Magic
  • 21. Apache Apex Meetup DAG - Complete File Splitter Block Reader Block Writer Synchronizer BlockMetaData BlockMetaData Data BlockMetaData FileMetaData FileMerger FileMetaData
  • 22. Apache Apex Meetup Other features: Optional processing ● Compression ○ Gzip and lzo ● Encryption ○ PKI & AES ● Compaction ○ Size based ● Dedup ● Dimension Computation & Aggregation
  • 24. Apache Apex Meetup Summary ● Easy to use ○ Configure and run ● Unified for batch and continuous ingestion ● Handles ○ Large files ○ Large number of small files
  • 25. 25
  • 26. Resources Apache Apex Meetup • Apache Apex website - http://apex.incubator.apache.org/ • Subscribe - http://apex.incubator.apache.org/community.html • Download - http://apex.incubator.apache.org/downloads.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Facebook - https://www.facebook.com/ApacheApex/ • Meetup - http://www.meetup.com/topics/apache-apex • Startup Program – Free Enterprise License for startups, Universities, Non-Profits