SlideShare a Scribd company logo
Introduction to the
Hadoop ecosystem
About me
About us
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
How to scale data?
w1 w2 w3
r1 r2 r3
But…
But…
What is Hadoop?
What is Hadoop?
What is Hadoop?
What is Hadoop?
The Hadoop App Store
HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra
Chukwa
Intel
Sync
Flume Hana HyperT Impala Mahout Nutch Oozie Scoop
Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC
IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper
Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat
Data Storage
Data Storage
Hadoop Distributed File System
•
•
•
Hadoop Distributed File System
•
•
HDFS Architecture
Data Processing
Data Processing
MapReduce
•
•
•
Typical large-data problem
•
•
•
•
•
MapReduce Flow
𝐤 𝟏 𝐯 𝟏 𝐤 𝟐 𝐯 𝟐 𝐤 𝟒 𝐯 𝟒 𝐤 𝟓 𝐯 𝟓 𝐤 𝟔 𝐯 𝟔𝐤 𝟑 𝐯 𝟑
a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8
a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8
a 1 3 b 𝟐 7 c 2 8 9
a 4 b 9 c 19
Combined Hadoop Architecture
Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
Scripting for Hadoop
Scripting for Hadoop
Apache Pig
•
•
•
•
Pig in the Hadoop ecosystem
Hadoop Distributed File System
Distributed Programming Framework
Metadata Management
Scripting
Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);
filteredUsers = FILTER users BY age >= 18 and age <=50;
joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Pig Execution Plan
Try that with Java…
SQL for Hadoop
SQL for Hadoop
Apache Hive
•
•
Hive in the Hadoop ecosystem
Hadoop Distributed File System
Distributed Programming Framework
Metadata Management
Scripting Query
Hive Architecture
Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO
TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';
SELECT pages.url, count(*) AS clicks FROM users JOIN
pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
Bringing it all together…
Online AdServing
•
•
•
•
AdServing Architecture
Getting started…
Hortonworks Sandbox
Hadoop Training
•
•
•
•
•
•
•
•
•

More Related Content

PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PPTX
Pptx present
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
ODP
Hadoop - Overview
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to Apache Drill - interactive query and analysis at scale
Big data, just an introduction to Hadoop and Scripting Languages
Pptx present
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop - Overview

What's hot (20)

PPTX
Hadoop & HDFS for Beginners
PDF
Keynote: Getting Serious about MySQL and Hadoop at Continuent
PDF
알쓸신잡
PDF
Introduction to Mongodb
PDF
Hadoop Pig: MapReduce the easy way!
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PPTX
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
PDF
Hive sq lfor-hadoop
PPTX
Practical Hadoop using Pig
PDF
Introduction to Hadoop
PPT
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
PDF
introduction to data processing using Hadoop and Pig
PPTX
Hadoop overview
PDF
The Hadoop Ecosystem
PDF
May 2013 HUG: HCatalog/Hive Data Out
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Migrating structured data between Hadoop and RDBMS
PPTX
Hadoop and mysql by Chris Schneider
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hadoop & HDFS for Beginners
Keynote: Getting Serious about MySQL and Hadoop at Continuent
알쓸신잡
Introduction to Mongodb
Hadoop Pig: MapReduce the easy way!
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Hive sq lfor-hadoop
Practical Hadoop using Pig
Introduction to Hadoop
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
introduction to data processing using Hadoop and Pig
Hadoop overview
The Hadoop Ecosystem
May 2013 HUG: HCatalog/Hive Data Out
PySpark Cassandra - Amsterdam Spark Meetup
Migrating structured data between Hadoop and RDBMS
Hadoop and mysql by Chris Schneider
Getting started with Hadoop, Hive, and Elastic MapReduce
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Ad

Viewers also liked (13)

PDF
MongoDB für Java Programmierer (JUGKA, 11.12.13)
PDF
Hadoop 2 - Beyond MapReduce
PDF
First meetup of the MongoDB User Group Frankfurt
PDF
Map/Confused? A practical approach to Map/Reduce with MongoDB
PDF
Lightning Talk: Agility & Databases
PDF
Hadoop 2 - More than MapReduce
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Hadoop 2 - Going beyond MapReduce
PDF
Hadoop & Security - Past, Present, Future
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Apache Spark
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PDF
MongoDB für Java-Programmierer
MongoDB für Java Programmierer (JUGKA, 11.12.13)
Hadoop 2 - Beyond MapReduce
First meetup of the MongoDB User Group Frankfurt
Map/Confused? A practical approach to Map/Reduce with MongoDB
Lightning Talk: Agility & Databases
Hadoop 2 - More than MapReduce
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Hadoop 2 - Going beyond MapReduce
Hadoop & Security - Past, Present, Future
Hadoop meets Agile! - An Agile Big Data Model
Apache Spark
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
MongoDB für Java-Programmierer
Ad

Similar to Introduction to the Hadoop Ecosystem (SEACON Edition) (20)

PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Hadoop with Python
PDF
Osd ctw spark
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPT
Hadoop trainingin bangalore
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
מיכאל
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PPTX
Hadoop workshop
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PDF
Introduction to apache hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Basic of Big Data
PDF
Lecture 2 part 3
PDF
Basics of big data analytics hadoop
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
PPTX
Hands on Hadoop and pig
PDF
Apache Eagle - Monitor Hadoop in Real Time
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Introduction to Apache Flink - Fast and reliable big data processing
Hadoop with Python
Osd ctw spark
Sf NoSQL MeetUp: Apache Hadoop and HBase
Big Data Analytics Projects - Real World with Pentaho
Hadoop trainingin bangalore
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
מיכאל
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Hadoop workshop
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Introduction to apache hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Basic of Big Data
Lecture 2 part 3
Basics of big data analytics hadoop
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Hands on Hadoop and pig
Apache Eagle - Monitor Hadoop in Real Time
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

More from Uwe Printz (6)

PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop Operations - Best practices from the field
PDF
Welcome to Hadoop2Land!
PDF
MongoDB for Coder Training (Coding Serbia 2013)
PDF
Introduction to Twitter Storm
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Hadoop Operations - Best practices from the field
Welcome to Hadoop2Land!
MongoDB for Coder Training (Coding Serbia 2013)
Introduction to Twitter Storm

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mushroom cultivation and it's methods.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative study of natural language inference in Swahili using monolingua...
Heart disease approach using modified random forest and particle swarm optimi...
SOPHOS-XG Firewall Administrator PPT.pptx
A comparative analysis of optical character recognition models for extracting...
Machine learning based COVID-19 study performance prediction
Mushroom cultivation and it's methods.pdf
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm

Introduction to the Hadoop Ecosystem (SEACON Edition)