SlideShare a Scribd company logo
Map-Reduce Programming
with Hadoop
CS5225 Parallel and Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
HDFS
 HDFS – Hadoop Distributed File System
 File system supported by Hadoop
 Based on ideas presented in “The Google File
System” Paper
 Highly scalable file system for handling large
data
2
HDFS Architecture
3
HDFS Architecture (Cont.)
 HDFS has master-slave architecture
 Name Node – Master node
 Manages file system namespace
 Regulates access to files by clients
 Data node
 Manage storage attached to nodes
 Responsible for serving read & write requests from
file system’s clients
 Perform block creation, deletion, & replication upon
instruction from Name Node
4
HDFS Architecture (Cont.)
5
HDFS in Production
 Yahoo! Search Webmap is a Hadoop application
 Webmap starts with every webpage crawled by Yahoo!
& produces a database of all known web pages
 This derived data feed to Machine Learned Ranking
algorithms
 Runs on 10,000+ core Linux clusters & produces
data that is used in every Yahoo! Web search
query
 1 trillion links
 Produce over 300 TB, compressed!
 Over 5 Petabytes of raw disk used in production cluster
6
HDFS Java Client
Configuration conf = new Configuration(false);
conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/core-site.xml"));
conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/hdfs-site.xml"));
FileSystem fs = null;
fs = FileSystem.get(conf);
Path filenamePath = new Path(filename);
FileSystem fs = getFileSystemConnection();
if (fs.exists(filenamePath)) {
// remove the file first
fs.delete(filenamePath);
}
FSDataOutputStream out = fs.create(filenamePath);
out.writeUTF(String.valueOf(currentSystemTime));
out.close();
FSDataInputStream in = fs.open(filenamePath);
String messageIn = in.readUTF();
System.out.print(messageIn);
in.close();
System.out.println(fs.getContentSummary(filenamePath).toString());
7
Install Hadoop
 3 different Options
1. Local
 One JVM installation
 Just Unzip
2. Pseudo Distributed
 One JVM, but like distributed installation
3. Distributed Installation
8
More General Map/Reduce
 Typically Map-Reduce implementations are bit
more general
1. Formatters
2. Partition Function
 Break map output across many reduce function
instances
3. Map Function
4. Combine Function
 If there are many map steps, this step combine the
result before giving it to Reduce
5. Reduce Function 9
Example – Word Count
 Find words in a collection of documents & their
frequency of occurrence
Map(docId, text):
for all terms t in text
emit(t, 1);
Reduce(t, values[])
int sum = 0;
for all values v
sum += v;
emit(t, sum); 10
Example – Mean
 Compute mean value associated with same key
Map(k, value):
emit(k, value);
Reduce(k, values[])
int sum = 0;
int count = 0;
for all values v
sum += v;
count += 1;
emit(k, sum/count); 11
Example – Sorting
 How to sort an array of 1 million integers using
Map reduce?
 Partial sorts at mapper & final sort by reducer
 Use of locality preserving hash function
 If k1 < k2 then hash(k1) < hash(k2)
Map(k, v):
int val = read value from v
emit(val, val);
Reduce(k, values[])
emit(k, k); 12
Example – Inverted Index
 Normal index is a mapping from document to terms
 Inverted index is mapping from terms to documents
 If we have a million documents, how do we build a
inverted index using Map-Reduce?
Map(docid, text):
for all word w in text
emit(w, docid)
Reduce(w, docids[])
emit(w, docids[]);
13
Example – Distributed Grep
map(k, v):
Id docId = .. (read file name)
If (v maps grep)
emit(k, (pattern, docid))
Reduce(k, values[])
emit(k, values);
14
Composition with Map-Reduce
 Map/Reduce is not a tool to use as a fixed
template
 It should be used with Fork/Join, etc., to build
solutions
 Solution may have more than one Map/Reduce
step
15
Composition with Map-Reduce –
Example
 Calculate following for a list of million integers
16
Map Reduce Client
public class WordCountSample {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {….. }
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException { ..}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCountSample.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("/input"));
FileOutputFormat.setOutputPath(conf, new Path("/output/"+ System.currentTimeMillis()));
JobClient.runJob(conf);
}
}
17
Example: http://wiki.apache.org/hadoop/WordCount
Format to Parse Custom Data
//add following to the main method
Job job = new Job(conf, "LogProcessingHitsByLink");
….
job.setInputFormatClass(MboxFileFormat.class);
..
System.exit(job.waitForCompletion(true) ? 0 : 1);
// write a formatter
public class MboxFileFormat extends FileInputFormat<Text, Text>{
private MBoxFileReader boxFileReader = null;
public RecordReader<Text, Text> createRecordReader(
InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException {
boxFileReader = new MBoxFileReader();
boxFileReader.initialize(inputSplit, attempt);
return boxFileReader;
}
}
//write a reader
public class MBoxFileReader extends RecordReader<Text, Text> {
public void initialize(InputSplit inputSplit, TaskAttemptContext attempt)
throws IOException, InterruptedException { .. }
public boolean nextKeyValue() throws IOException, InterruptedException { ..}
18
Your Own Partioner
public class IPBasedPartitioner extends Partitioner<Text, IntWritable>{
public int getPartition(Text ipAddress, IntWritable value, int numPartitions) {
String region = getGeoLocation(ipAddress);
if (region!=null){
return ((region.hashCode() & Integer.MAX_VALUE) % numPartitions);
}
return 0;
}
}
Set the Partitioner class parameter in the job object.
Job job = new Job(getConf(), "log-analysis");
……
job.setPartitionerClass(IPBasedPartitioner.class);
19
Using Distributed File Cache
 Give access to a static file from a Job
Job job = new Job(conf, "word count");
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path(scriptFileLocation),
new Path("/debug/fail-script"));
DistributedCache.addCacheFile(mapUri, conf);
DistributedCache.createSymlink(conf);
20

More Related Content

Similar to Introduction to Map-Reduce Programming with Hadoop (20)

PDF
Hadoop
devakalyan143
 
PPTX
map Reduce.pptx
habibaabderrahim1
 
PPT
Map Reduce
Manuel Correa
 
PPT
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
ssusere82d541
 
PPTX
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
PPTX
Map reducefunnyslide
letstalkbigdata
 
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
PDF
Map reduce
Shahbaz Sidhu
 
PDF
Lecture 1 mapreduce
Shubham Bansal
 
PDF
Introduction to map reduce
Bhupesh Chawda
 
PDF
Mapreduce2008 cacm
lmphuong06
 
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
PPTX
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
PPTX
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
PDF
An Introduction to MapReduce
Sina Ebrahimi
 
PPTX
MapReduce wordcount program
Sarwan Singh
 
PDF
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
PPTX
Lecture 04 big data analytics | map reduce
anasbro009
 
PPTX
MapReduce basic
Chirag Ahuja
 
PPTX
Unit 2
vishal choudhary
 
map Reduce.pptx
habibaabderrahim1
 
Map Reduce
Manuel Correa
 
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
ssusere82d541
 
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Map reducefunnyslide
letstalkbigdata
 
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Shahbaz Sidhu
 
Lecture 1 mapreduce
Shubham Bansal
 
Introduction to map reduce
Bhupesh Chawda
 
Mapreduce2008 cacm
lmphuong06
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
An Introduction to MapReduce
Sina Ebrahimi
 
MapReduce wordcount program
Sarwan Singh
 
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
Lecture 04 big data analytics | map reduce
anasbro009
 
MapReduce basic
Chirag Ahuja
 

More from Dilum Bandara (20)

PPTX
Designing for Multiple Blockchains in Industry Ecosystems
Dilum Bandara
 
PPTX
Introduction to Machine Learning
Dilum Bandara
 
PPTX
Time Series Analysis and Forecasting in Practice
Dilum Bandara
 
PPTX
Introduction to Dimension Reduction with PCA
Dilum Bandara
 
PPTX
Introduction to Descriptive & Predictive Analytics
Dilum Bandara
 
PPTX
Introduction to Concurrent Data Structures
Dilum Bandara
 
PPTX
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Dilum Bandara
 
PPTX
Introduction to Warehouse-Scale Computers
Dilum Bandara
 
PPTX
Introduction to Thread Level Parallelism
Dilum Bandara
 
PPTX
CPU Memory Hierarchy and Caching Techniques
Dilum Bandara
 
PPTX
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
PDF
Instruction Level Parallelism – Hardware Techniques
Dilum Bandara
 
PPTX
Instruction Level Parallelism – Compiler Techniques
Dilum Bandara
 
PPTX
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
PPTX
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
PPTX
High Performance Networking with Advanced TCP
Dilum Bandara
 
PPTX
Introduction to Content Delivery Networks
Dilum Bandara
 
PPTX
Peer-to-Peer Networking Systems and Streaming
Dilum Bandara
 
PPTX
Mobile Services
Dilum Bandara
 
PPTX
Wired Broadband Communication
Dilum Bandara
 
Designing for Multiple Blockchains in Industry Ecosystems
Dilum Bandara
 
Introduction to Machine Learning
Dilum Bandara
 
Time Series Analysis and Forecasting in Practice
Dilum Bandara
 
Introduction to Dimension Reduction with PCA
Dilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Dilum Bandara
 
Introduction to Concurrent Data Structures
Dilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Dilum Bandara
 
Introduction to Warehouse-Scale Computers
Dilum Bandara
 
Introduction to Thread Level Parallelism
Dilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
Dilum Bandara
 
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Dilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Dilum Bandara
 
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
High Performance Networking with Advanced TCP
Dilum Bandara
 
Introduction to Content Delivery Networks
Dilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Dilum Bandara
 
Mobile Services
Dilum Bandara
 
Wired Broadband Communication
Dilum Bandara
 
Ad

Recently uploaded (20)

PDF
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
Practical Applications of AI in Local Government
OnBoard
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Ad

Introduction to Map-Reduce Programming with Hadoop

  • 1. Map-Reduce Programming with Hadoop CS5225 Parallel and Concurrent Programming Dilum Bandara [email protected] Some slides adapted from Dr. Srinath Perera
  • 2. HDFS  HDFS – Hadoop Distributed File System  File system supported by Hadoop  Based on ideas presented in “The Google File System” Paper  Highly scalable file system for handling large data 2
  • 4. HDFS Architecture (Cont.)  HDFS has master-slave architecture  Name Node – Master node  Manages file system namespace  Regulates access to files by clients  Data node  Manage storage attached to nodes  Responsible for serving read & write requests from file system’s clients  Perform block creation, deletion, & replication upon instruction from Name Node 4
  • 6. HDFS in Production  Yahoo! Search Webmap is a Hadoop application  Webmap starts with every webpage crawled by Yahoo! & produces a database of all known web pages  This derived data feed to Machine Learned Ranking algorithms  Runs on 10,000+ core Linux clusters & produces data that is used in every Yahoo! Web search query  1 trillion links  Produce over 300 TB, compressed!  Over 5 Petabytes of raw disk used in production cluster 6
  • 7. HDFS Java Client Configuration conf = new Configuration(false); conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/core-site.xml")); conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/hdfs-site.xml")); FileSystem fs = null; fs = FileSystem.get(conf); Path filenamePath = new Path(filename); FileSystem fs = getFileSystemConnection(); if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(String.valueOf(currentSystemTime)); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); System.out.println(fs.getContentSummary(filenamePath).toString()); 7
  • 8. Install Hadoop  3 different Options 1. Local  One JVM installation  Just Unzip 2. Pseudo Distributed  One JVM, but like distributed installation 3. Distributed Installation 8
  • 9. More General Map/Reduce  Typically Map-Reduce implementations are bit more general 1. Formatters 2. Partition Function  Break map output across many reduce function instances 3. Map Function 4. Combine Function  If there are many map steps, this step combine the result before giving it to Reduce 5. Reduce Function 9
  • 10. Example – Word Count  Find words in a collection of documents & their frequency of occurrence Map(docId, text): for all terms t in text emit(t, 1); Reduce(t, values[]) int sum = 0; for all values v sum += v; emit(t, sum); 10
  • 11. Example – Mean  Compute mean value associated with same key Map(k, value): emit(k, value); Reduce(k, values[]) int sum = 0; int count = 0; for all values v sum += v; count += 1; emit(k, sum/count); 11
  • 12. Example – Sorting  How to sort an array of 1 million integers using Map reduce?  Partial sorts at mapper & final sort by reducer  Use of locality preserving hash function  If k1 < k2 then hash(k1) < hash(k2) Map(k, v): int val = read value from v emit(val, val); Reduce(k, values[]) emit(k, k); 12
  • 13. Example – Inverted Index  Normal index is a mapping from document to terms  Inverted index is mapping from terms to documents  If we have a million documents, how do we build a inverted index using Map-Reduce? Map(docid, text): for all word w in text emit(w, docid) Reduce(w, docids[]) emit(w, docids[]); 13
  • 14. Example – Distributed Grep map(k, v): Id docId = .. (read file name) If (v maps grep) emit(k, (pattern, docid)) Reduce(k, values[]) emit(k, values); 14
  • 15. Composition with Map-Reduce  Map/Reduce is not a tool to use as a fixed template  It should be used with Fork/Join, etc., to build solutions  Solution may have more than one Map/Reduce step 15
  • 16. Composition with Map-Reduce – Example  Calculate following for a list of million integers 16
  • 17. Map Reduce Client public class WordCountSample { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {….. } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { ..} } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCountSample.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("/input")); FileOutputFormat.setOutputPath(conf, new Path("/output/"+ System.currentTimeMillis())); JobClient.runJob(conf); } } 17 Example: http://wiki.apache.org/hadoop/WordCount
  • 18. Format to Parse Custom Data //add following to the main method Job job = new Job(conf, "LogProcessingHitsByLink"); …. job.setInputFormatClass(MboxFileFormat.class); .. System.exit(job.waitForCompletion(true) ? 0 : 1); // write a formatter public class MboxFileFormat extends FileInputFormat<Text, Text>{ private MBoxFileReader boxFileReader = null; public RecordReader<Text, Text> createRecordReader( InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException { boxFileReader = new MBoxFileReader(); boxFileReader.initialize(inputSplit, attempt); return boxFileReader; } } //write a reader public class MBoxFileReader extends RecordReader<Text, Text> { public void initialize(InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException { .. } public boolean nextKeyValue() throws IOException, InterruptedException { ..} 18
  • 19. Your Own Partioner public class IPBasedPartitioner extends Partitioner<Text, IntWritable>{ public int getPartition(Text ipAddress, IntWritable value, int numPartitions) { String region = getGeoLocation(ipAddress); if (region!=null){ return ((region.hashCode() & Integer.MAX_VALUE) % numPartitions); } return 0; } } Set the Partitioner class parameter in the job object. Job job = new Job(getConf(), "log-analysis"); …… job.setPartitionerClass(IPBasedPartitioner.class); 19
  • 20. Using Distributed File Cache  Give access to a static file from a Job Job job = new Job(conf, "word count"); FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(scriptFileLocation), new Path("/debug/fail-script")); DistributedCache.addCacheFile(mapUri, conf); DistributedCache.createSymlink(conf); 20