Map-Reduce Programming
with Hadoop
CS5225 Parallel and Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
HDFS
 HDFS – Hadoop Distributed File System
 File system supported by Hadoop
 Based on ideas presented in “The Google File
System” Paper
 Highly scalable file system for handling large
data
2
HDFS Architecture
3
HDFS Architecture (Cont.)
 HDFS has master-slave architecture
 Name Node – Master node
 Manages file system namespace
 Regulates access to files by clients
 Data node
 Manage storage attached to nodes
 Responsible for serving read & write requests from
file system’s clients
 Perform block creation, deletion, & replication upon
instruction from Name Node
4
HDFS Architecture (Cont.)
5
HDFS in Production
 Yahoo! Search Webmap is a Hadoop application
 Webmap starts with every webpage crawled by Yahoo!
& produces a database of all known web pages
 This derived data feed to Machine Learned Ranking
algorithms
 Runs on 10,000+ core Linux clusters & produces
data that is used in every Yahoo! Web search
query
 1 trillion links
 Produce over 300 TB, compressed!
 Over 5 Petabytes of raw disk used in production cluster
6
HDFS Java Client
Configuration conf = new Configuration(false);
conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/core-site.xml"));
conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/hdfs-site.xml"));
FileSystem fs = null;
fs = FileSystem.get(conf);
Path filenamePath = new Path(filename);
FileSystem fs = getFileSystemConnection();
if (fs.exists(filenamePath)) {
// remove the file first
fs.delete(filenamePath);
}
FSDataOutputStream out = fs.create(filenamePath);
out.writeUTF(String.valueOf(currentSystemTime));
out.close();
FSDataInputStream in = fs.open(filenamePath);
String messageIn = in.readUTF();
System.out.print(messageIn);
in.close();
System.out.println(fs.getContentSummary(filenamePath).toString());
7
Install Hadoop
 3 different Options
1. Local
 One JVM installation
 Just Unzip
2. Pseudo Distributed
 One JVM, but like distributed installation
3. Distributed Installation
8
More General Map/Reduce
 Typically Map-Reduce implementations are bit
more general
1. Formatters
2. Partition Function
 Break map output across many reduce function
instances
3. Map Function
4. Combine Function
 If there are many map steps, this step combine the
result before giving it to Reduce
5. Reduce Function 9
Example – Word Count
 Find words in a collection of documents & their
frequency of occurrence
Map(docId, text):
for all terms t in text
emit(t, 1);
Reduce(t, values[])
int sum = 0;
for all values v
sum += v;
emit(t, sum); 10
Example – Mean
 Compute mean value associated with same key
Map(k, value):
emit(k, value);
Reduce(k, values[])
int sum = 0;
int count = 0;
for all values v
sum += v;
count += 1;
emit(k, sum/count); 11
Example – Sorting
 How to sort an array of 1 million integers using
Map reduce?
 Partial sorts at mapper & final sort by reducer
 Use of locality preserving hash function
 If k1 < k2 then hash(k1) < hash(k2)
Map(k, v):
int val = read value from v
emit(val, val);
Reduce(k, values[])
emit(k, k); 12
Example – Inverted Index
 Normal index is a mapping from document to terms
 Inverted index is mapping from terms to documents
 If we have a million documents, how do we build a
inverted index using Map-Reduce?
Map(docid, text):
for all word w in text
emit(w, docid)
Reduce(w, docids[])
emit(w, docids[]);
13
Example – Distributed Grep
map(k, v):
Id docId = .. (read file name)
If (v maps grep)
emit(k, (pattern, docid))
Reduce(k, values[])
emit(k, values);
14
Composition with Map-Reduce
 Map/Reduce is not a tool to use as a fixed
template
 It should be used with Fork/Join, etc., to build
solutions
 Solution may have more than one Map/Reduce
step
15
Composition with Map-Reduce –
Example
 Calculate following for a list of million integers
16
Map Reduce Client
public class WordCountSample {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {….. }
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException { ..}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCountSample.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("/input"));
FileOutputFormat.setOutputPath(conf, new Path("/output/"+ System.currentTimeMillis()));
JobClient.runJob(conf);
}
}
17
Example: http://wiki.apache.org/hadoop/WordCount
Format to Parse Custom Data
//add following to the main method
Job job = new Job(conf, "LogProcessingHitsByLink");
….
job.setInputFormatClass(MboxFileFormat.class);
..
System.exit(job.waitForCompletion(true) ? 0 : 1);
// write a formatter
public class MboxFileFormat extends FileInputFormat<Text, Text>{
private MBoxFileReader boxFileReader = null;
public RecordReader<Text, Text> createRecordReader(
InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException {
boxFileReader = new MBoxFileReader();
boxFileReader.initialize(inputSplit, attempt);
return boxFileReader;
}
}
//write a reader
public class MBoxFileReader extends RecordReader<Text, Text> {
public void initialize(InputSplit inputSplit, TaskAttemptContext attempt)
throws IOException, InterruptedException { .. }
public boolean nextKeyValue() throws IOException, InterruptedException { ..}
18
Your Own Partioner
public class IPBasedPartitioner extends Partitioner<Text, IntWritable>{
public int getPartition(Text ipAddress, IntWritable value, int numPartitions) {
String region = getGeoLocation(ipAddress);
if (region!=null){
return ((region.hashCode() & Integer.MAX_VALUE) % numPartitions);
}
return 0;
}
}
Set the Partitioner class parameter in the job object.
Job job = new Job(getConf(), "log-analysis");
……
job.setPartitionerClass(IPBasedPartitioner.class);
19
Using Distributed File Cache
 Give access to a static file from a Job
Job job = new Job(conf, "word count");
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path(scriptFileLocation),
new Path("/debug/fail-script"));
DistributedCache.addCacheFile(mapUri, conf);
DistributedCache.createSymlink(conf);
20

More Related Content

PPTX
Map reduce and Hadoop on windows
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPT
MapReduce in cgrid and cloud computinge.ppt
PPTX
This gives a brief detail about big data
PDF
Mapreduce by examples
PPTX
Hadoop and Mapreduce for .NET User Group
PPT
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
PPTX
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce and Hadoop on windows
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
MapReduce in cgrid and cloud computinge.ppt
This gives a brief detail about big data
Mapreduce by examples
Hadoop and Mapreduce for .NET User Group
L4.FA16n nm,m,m,,m,m,m,mmnm,n,mnmnmm.ppt
Map reduce in Hadoop BIG DATA ANALYTICS

Similar to Introduction to Map-Reduce Programming with Hadoop (20)

PPTX
MapReduce and Hadoop Introcuctory Presentation
PDF
PPTX
Map reduce helpful for college students.pptx
PPTX
map Reduce.pptx
PPT
Map Reduce
PPT
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
PPTX
Map-Reduce and Apache Hadoop
PPTX
Map reducefunnyslide
PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
PDF
Map reduce
PDF
Lecture 1 mapreduce
PDF
Introduction to map reduce
PDF
Mapreduce2008 cacm
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PPTX
Embarrassingly/Delightfully Parallel Problems
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
Hadoop MapReduce framework - Module 3
PDF
An Introduction to MapReduce
PPTX
MapReduce wordcount program
PPTX
introduction to Complete Map and Reduce Framework
MapReduce and Hadoop Introcuctory Presentation
Map reduce helpful for college students.pptx
map Reduce.pptx
Map Reduce
Lecture 4 Parallel and Distributed Systems Fall 2024.ppt
Map-Reduce and Apache Hadoop
Map reducefunnyslide
2004 map reduce simplied data processing on large clusters (mapreduce)
Map reduce
Lecture 1 mapreduce
Introduction to map reduce
Mapreduce2008 cacm
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Embarrassingly/Delightfully Parallel Problems
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Hadoop MapReduce framework - Module 3
An Introduction to MapReduce
MapReduce wordcount program
introduction to Complete Map and Reduce Framework
Ad

More from Dilum Bandara (20)

PPTX
Designing for Multiple Blockchains in Industry Ecosystems
PPTX
Introduction to Machine Learning
PPTX
Time Series Analysis and Forecasting in Practice
PPTX
Introduction to Dimension Reduction with PCA
PPTX
Introduction to Descriptive & Predictive Analytics
PPTX
Introduction to Concurrent Data Structures
PPTX
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
PPTX
Introduction to Warehouse-Scale Computers
PPTX
Introduction to Thread Level Parallelism
PPTX
CPU Memory Hierarchy and Caching Techniques
PPTX
Data-Level Parallelism in Microprocessors
PDF
Instruction Level Parallelism – Hardware Techniques
PPTX
Instruction Level Parallelism – Compiler Techniques
PPTX
CPU Pipelining and Hazards - An Introduction
PPTX
Advanced Computer Architecture – An Introduction
PPTX
High Performance Networking with Advanced TCP
PPTX
Introduction to Content Delivery Networks
PPTX
Peer-to-Peer Networking Systems and Streaming
PPTX
Mobile Services
PPTX
Wired Broadband Communication
Designing for Multiple Blockchains in Industry Ecosystems
Introduction to Machine Learning
Time Series Analysis and Forecasting in Practice
Introduction to Dimension Reduction with PCA
Introduction to Descriptive & Predictive Analytics
Introduction to Concurrent Data Structures
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Introduction to Warehouse-Scale Computers
Introduction to Thread Level Parallelism
CPU Memory Hierarchy and Caching Techniques
Data-Level Parallelism in Microprocessors
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Compiler Techniques
CPU Pipelining and Hazards - An Introduction
Advanced Computer Architecture – An Introduction
High Performance Networking with Advanced TCP
Introduction to Content Delivery Networks
Peer-to-Peer Networking Systems and Streaming
Mobile Services
Wired Broadband Communication
Ad

Recently uploaded (20)

PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
observCloud-Native Containerability and monitoring.pptx
DOCX
search engine optimization ppt fir known well about this
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Five Habits of High-Impact Board Members
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPT
Geologic Time for studying geology for geologist
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
The various Industrial Revolutions .pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
observCloud-Native Containerability and monitoring.pptx
search engine optimization ppt fir known well about this
A comparative study of natural language inference in Swahili using monolingua...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
A novel scalable deep ensemble learning framework for big data classification...
Five Habits of High-Impact Board Members
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Unlock new opportunities with location data.pdf
sustainability-14-14877-v2.pddhzftheheeeee
Geologic Time for studying geology for geologist
CloudStack 4.21: First Look Webinar slides
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
The various Industrial Revolutions .pptx
Group 1 Presentation -Planning and Decision Making .pptx
DP Operators-handbook-extract for the Mautical Institute
Final SEM Unit 1 for mit wpu at pune .pptx
1 - Historical Antecedents, Social Consideration.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf

Introduction to Map-Reduce Programming with Hadoop

  • 1. Map-Reduce Programming with Hadoop CS5225 Parallel and Concurrent Programming Dilum Bandara [email protected] Some slides adapted from Dr. Srinath Perera
  • 2. HDFS  HDFS – Hadoop Distributed File System  File system supported by Hadoop  Based on ideas presented in “The Google File System” Paper  Highly scalable file system for handling large data 2
  • 4. HDFS Architecture (Cont.)  HDFS has master-slave architecture  Name Node – Master node  Manages file system namespace  Regulates access to files by clients  Data node  Manage storage attached to nodes  Responsible for serving read & write requests from file system’s clients  Perform block creation, deletion, & replication upon instruction from Name Node 4
  • 6. HDFS in Production  Yahoo! Search Webmap is a Hadoop application  Webmap starts with every webpage crawled by Yahoo! & produces a database of all known web pages  This derived data feed to Machine Learned Ranking algorithms  Runs on 10,000+ core Linux clusters & produces data that is used in every Yahoo! Web search query  1 trillion links  Produce over 300 TB, compressed!  Over 5 Petabytes of raw disk used in production cluster 6
  • 7. HDFS Java Client Configuration conf = new Configuration(false); conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/core-site.xml")); conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/hdfs-site.xml")); FileSystem fs = null; fs = FileSystem.get(conf); Path filenamePath = new Path(filename); FileSystem fs = getFileSystemConnection(); if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(String.valueOf(currentSystemTime)); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); System.out.println(fs.getContentSummary(filenamePath).toString()); 7
  • 8. Install Hadoop  3 different Options 1. Local  One JVM installation  Just Unzip 2. Pseudo Distributed  One JVM, but like distributed installation 3. Distributed Installation 8
  • 9. More General Map/Reduce  Typically Map-Reduce implementations are bit more general 1. Formatters 2. Partition Function  Break map output across many reduce function instances 3. Map Function 4. Combine Function  If there are many map steps, this step combine the result before giving it to Reduce 5. Reduce Function 9
  • 10. Example – Word Count  Find words in a collection of documents & their frequency of occurrence Map(docId, text): for all terms t in text emit(t, 1); Reduce(t, values[]) int sum = 0; for all values v sum += v; emit(t, sum); 10
  • 11. Example – Mean  Compute mean value associated with same key Map(k, value): emit(k, value); Reduce(k, values[]) int sum = 0; int count = 0; for all values v sum += v; count += 1; emit(k, sum/count); 11
  • 12. Example – Sorting  How to sort an array of 1 million integers using Map reduce?  Partial sorts at mapper & final sort by reducer  Use of locality preserving hash function  If k1 < k2 then hash(k1) < hash(k2) Map(k, v): int val = read value from v emit(val, val); Reduce(k, values[]) emit(k, k); 12
  • 13. Example – Inverted Index  Normal index is a mapping from document to terms  Inverted index is mapping from terms to documents  If we have a million documents, how do we build a inverted index using Map-Reduce? Map(docid, text): for all word w in text emit(w, docid) Reduce(w, docids[]) emit(w, docids[]); 13
  • 14. Example – Distributed Grep map(k, v): Id docId = .. (read file name) If (v maps grep) emit(k, (pattern, docid)) Reduce(k, values[]) emit(k, values); 14
  • 15. Composition with Map-Reduce  Map/Reduce is not a tool to use as a fixed template  It should be used with Fork/Join, etc., to build solutions  Solution may have more than one Map/Reduce step 15
  • 16. Composition with Map-Reduce – Example  Calculate following for a list of million integers 16
  • 17. Map Reduce Client public class WordCountSample { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {….. } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { ..} } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCountSample.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("/input")); FileOutputFormat.setOutputPath(conf, new Path("/output/"+ System.currentTimeMillis())); JobClient.runJob(conf); } } 17 Example: http://wiki.apache.org/hadoop/WordCount
  • 18. Format to Parse Custom Data //add following to the main method Job job = new Job(conf, "LogProcessingHitsByLink"); …. job.setInputFormatClass(MboxFileFormat.class); .. System.exit(job.waitForCompletion(true) ? 0 : 1); // write a formatter public class MboxFileFormat extends FileInputFormat<Text, Text>{ private MBoxFileReader boxFileReader = null; public RecordReader<Text, Text> createRecordReader( InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException { boxFileReader = new MBoxFileReader(); boxFileReader.initialize(inputSplit, attempt); return boxFileReader; } } //write a reader public class MBoxFileReader extends RecordReader<Text, Text> { public void initialize(InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException { .. } public boolean nextKeyValue() throws IOException, InterruptedException { ..} 18
  • 19. Your Own Partioner public class IPBasedPartitioner extends Partitioner<Text, IntWritable>{ public int getPartition(Text ipAddress, IntWritable value, int numPartitions) { String region = getGeoLocation(ipAddress); if (region!=null){ return ((region.hashCode() & Integer.MAX_VALUE) % numPartitions); } return 0; } } Set the Partitioner class parameter in the job object. Job job = new Job(getConf(), "log-analysis"); …… job.setPartitionerClass(IPBasedPartitioner.class); 19
  • 20. Using Distributed File Cache  Give access to a static file from a Job Job job = new Job(conf, "word count"); FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(scriptFileLocation), new Path("/debug/fail-script")); DistributedCache.addCacheFile(mapUri, conf); DistributedCache.createSymlink(conf); 20