Ingestion file copy using apex

Apache Apex Meetup
Big Data File Ingestion
using Apex
Sandeep Deshmukh, PhD
sandeep@apache.org

Apache Apex Meetup
Contents
● What is Big Data Ingestion
● Challenges in File copy @ scale
● Ingestion using Apex
○ Input
○ Output
○ Key features
● Demo
● Summary

Apache Apex Meetup
Directed Acyclic Graph (DAG)
•A Stream is a sequence of data tuples
•An Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance in single-threaded
•Directed Acyclic Graph (DAG) is made up of operators and streams
Apex: Application Programming Model
Output StreamTuple Tuple
FilteredStream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
Filtered
Stream

Apache Apex Meetup
What is Ingestion
Data ingestion
● process of obtaining, importing, and processing data for later use or storage
in a database
Big Data Ingestion
● discovering the data sources
● importing the data
● processing data to produce intermediate data
● Send data out to durable data stores

Apache Apex Meetup
Challenges in File copy @ scale
● Failure Recovery
● Copying big files in parallel
● Copying large number of small files
● Processing
○ Encryption
○ Compression
○ Compaction

Apache Apex Meetup
DAG - Components
Read Data Write Data
Process

Apache Apex Meetup
DAG - Read Data : Requirements
● Independent of input file type
○ HDFS
○ S3
○ FTP
○ NFS
● Scale to large data
○ Large files
○ Large number of small files
● Configurable Bandwidth usage

Apache Apex Meetup
DAG - Read Data
Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data
Assign smaller tasks for
downstream operators
StepsPurposeName
Work on the sub-tasks
given by Operator 1, one
at a time
Connect to source and
read data as smaller
tasks one-by-one
Pass on the read data to
downstream operator
Write File
Save the data read by
Operator 2
File
Splitter
Block
Reader
File
Writer

Apache Apex Meetup
DAG - Simple Design
File
Splitter
Block
Reader
File
WriterBlockMetaData Data
Challenges
● Reading files in parallel is not possible
○ Can have multiple Block Readers and File Writers reading multiple files in
parallel but single file can’t be read by two Block Readers
● Failure recovery is hard

Apache Apex Meetup
DAG - Read Data
Break the whole task
into smaller sub-tasks
Connect to input and
scan for available data
Assign smaller tasks
for downstream
operators
StepsPurposeName
Work on the sub-tasks
given by Operator 1,
one at a time
Connect to source and
read data as smaller
tasks one-by-one
Pass on the read data to
downstream operator
Write File
Save the data read
by Operator 2
File
Splitter
Block
Reader
File
Writer
Check for completeness
Make sure all smaller
tasks for a file are
completed by upstream
operators & send file
merger trigger
Synchronizer

Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
BlockMetaData
Data
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
FileMetaData
BlockMetaData
BlockMetaData
Data

Apache Apex Meetup
Input DAG - FileSplitter
Scan input files/ directories
Create smaller sub-tasks
FileMetaData
BlockMetaData
File
Splitter
● Parameters
○ input files/directories to copy data from
○ recursive - Yes / No
○ polling - Yes / No
○ bandwidth - MB / sec

Apache Apex Meetup
Input DAG - FileSplitter
● For each file in the directory:
■ [output] FileMetaData - file information
● Name
● Size
● Relative path
● Block IDs into which the file is virtually split
■ [output] BlockMetaData - block information
● BlockID
● Start position
● End position
● File URL
InputFile.txt
1073741824 (1GB)
input/data/InputFile.txt
[0,1,2,3,4,5,6,7,8]
1
134217728
268435456 (128MB)
hdfs://node18:8020/user/sandeep/input

Apache Apex Meetup
Input DAG - BlockReader
Block
Reader
Read block from remote
location and emit Data
Data
BlockMetaData
● Parameters
Input URL: E.g.: hdfs://node18:8020/user/hduser/input
BlockMetaData

Apache Apex Meetup
Input DAG - BlockWriter
Block
Writer
Write block data on local
HDFS
BlockMetaData
BlockMetaData
Data
Saves data in apps directory

Apache Apex Meetup
Input DAG - Synchronizer
Track blocks for each file
and send trigger once all
the block for that file
are available
FileMetaDataSynchronizer
FileMetaData
BlockMetaData

Apache Apex Meetup
DAG - Input
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData
Data
FileMetaData
BlockMetaData
BlockMetaData
FileMetaData

Apache Apex Meetup
Output DAG - FileMerger
Merge blocks to recreate
original file
FileMerger
● Parameters
○ Output directory to copy data to
○ Overwrite - Yes/No
FileMetaData

Apache Apex Meetup
Output DAG - FileMerger - FastMerge Magic
Different
Blocks:
File :
B1
DataNode1DataNode2
DataNode3DataNode4
B2
B1
B1
B2
B2
Bn
Bn
Bn
BnB1 B2
1
2
1
1
2
2
n
n
n
1 2 n

Apache Apex Meetup
● Same replication factor
● On same HDFS cluster
● Same block size for all files
● Size of all files (except last) : multiple of block size
Output DAG - FileMerger - FastMerge Magic

Apache Apex Meetup
DAG - Complete
File
Splitter
Block
Reader
Block
Writer
Synchronizer
BlockMetaData BlockMetaData
Data
BlockMetaData
FileMetaData
FileMerger
FileMetaData

Apache Apex Meetup
Other features: Optional processing
● Compression
○ Gzip and lzo
● Encryption
○ PKI & AES
● Compaction
○ Size based
● Dedup
● Dimension Computation & Aggregation

Apache Apex Meetup
Summary
● Easy to use
○ Configure and run
● Unified for batch and continuous ingestion
● Handles
○ Large files
○ Large number of small files

Resources
Apache Apex Meetup
• Apache Apex website - http://apex.incubator.apache.org/
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Startup Program – Free Enterprise License for startups, Universities, Non-Profits

Ingestion file copy using apex

More Related Content

What's hot (20)

Similar to Ingestion file copy using apex (20)

More from Apache Apex (18)

Recently uploaded (20)

Ingestion file copy using apex