SlideShare a Scribd company logo
Tutorial Meeting #2
Data Science and Machine
Learning
Map-Reduce and
the New Software Stack
Περικλής Ανδρίτσος
Introduction
2
• Once upon a time….
• There used to be just a single computer
• It used to perform calculations all by itself
• Later in the years to come….
• Data sources started proliferating
• Size of the data sources that are used in calculations exploded
• Advanced programming techniques started to appear
• That’s when advanced science came in and gave us “distributed
computation”
• MapReduce is a representative of such technology
MapReduce (section
2)
3
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard
• Map-reduce addresses all of the above
• Google’s computational/data manipulation model
• Elegant way to work with big data
Single Node Architecture (section
2.1)
4
Memory
Disk
CPU
Machine Learning, Statistics
“Classical” Data Mining
{
Main needs:
a storage infrastructure and a programming paradigm (for
analysis)
Motivation: Google Example
5
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data, both in terms of storing it as well as the time needed
to analyze it !!
• Today, a standard architecture for such problems is emerging:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them
Big picture of the new
paradigm
6
• Split the data into chunks
• Have multiple disks and CPUs
• Distribute the data chunks across multiple disks
• Process the data in parallel across CPUs
• Example:
• Time to read data: 4million secs (46+ days)
• If we had 1000 CPUs we can do the task in
• 4 million / 1000 = 4000 seconds (about 1 hour)
Cluster Architecture (Section 2.1.1)
7
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch
1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
It has been estimated that Google has 1M machines,
http://bit.ly/Shh0RO
8
Large-scale
Computing
9
• Large-scale computing for data mining
problems on commodity hardware
• Challenges:
• How do you distribute computation?
• How can we make it easy to write distributed programs?
• Machines fail:
• One server may stay up 3 years (1,000 days)
• If you have 1,000 servers, expect to lose 1/day
• If Google has ~1M machines, then
• 1,000 machines fail every day!
Large-scale Computing (2)
10
• Problem #1: If nodes fail at such a rate, how do we make sure data is
stored PERSISTENTLY
• i.e., we are guaranteed to have data (or its copies) if machines that store it
fail.
• What if nodes fail during a long-running process what do we do ?
• E.g., do we restart the process all over again ?
• Solution: a new infrastructure that ”hides” all these failures
Large-scale Computing (3)
11
• Problem #2: Network Bottleneck
• If bandwidth = 1 Gbps, moving 10TB takes ~1 day
• In typical applications data moves around to be analyzed
• In our case we should be moving enormous amounts of data into thousands
of servers and that can be prohibitive
• Solution: a new framework that doesn’t move data around !
• In my opinion: the big beauty of MapReduce !!
Map-Reduce (Section
2.2)
12
• It addresses previously mentioned problems in the following ways:
• Stores data redundantly on multiple nodes (computers) for persistence and
availability
• Move computation close to the data to minimize data movement. This is a
simple yet powerful idea.
• Simple programming model that even non-savvy programmers can
implement
Idea and
Solution
13
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Google’s computational/data manipulation model
• Elegant way to work with big data
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce
Storage Infrastructure
14
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System (redundant storage):
• Provides global file namespace
• Implementations: Google GFS; Hadoop HDFS;
• Stores each data piece multiple times across servers
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place (e.g. storing URLs)
• Reads and appends are common
Distributed File System
15
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure
C0 C1
C2
C5
D1
C5
C1
C3
C5
…
C2
D0
D0
Bring computation directly to the data!
C0 C5
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
C2
D0
Chunk servers also serve as compute servers
Distributed File System
16
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores information about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
Coordination: Master
17
• Master node takes care of coordination:
• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its
R intermediate files, one for each reducer
• Master pushes this info to reducers
• Master pings workers periodically to detect failures
Dealing with Failures
18
• Map worker failure
• Map tasks completed or in-progress at
worker are reset to idle
• Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
• Only in-progress tasks are reset to idle
• Reduce task is restarted
• Master failure
• MapReduce task is aborted and client is notified
Fault tolerance:
Handled via re-execution
• On worker failure:
• Detect failure via periodic heartbeats
• Re-execute completed and in-progress map tasks
• Task completion committed through master
• Robust: Google lost 1600 of 1800 machines, but kept running fine
Programming Model: MapReduce
20
Warm-up task:
• We have a huge text document
• Maybe 10-20TB of size
• Count the number of times each
distinct word appears in the file
• Sample application:
• Analyze web server logs to find popular URLs
• Keyword statistic for search
Task: Word
Count
21
Case 1:
• File too large for memory, but all <word, count> pairs fit in memory
<-
simple program !
Case 2:
• Count occurrences of words:
• words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
• Case 2 captures the essence of MapReduce
• Great thing is that it is naturally parallelizable
MapReduce: Overview
22
words(doc.txt) | sort
• Map
• Scan input file one record at a time
• Extract something you care about from each record (keys)
• Group by key
• Groups all the keys with the same value
• Sort and shuffle
• Reduce
• Run a function
• Aggregate, summarize, filter or transform
• Write the answer
MapReduce: The Map
Step
24
v
k
map
v
k
v
k
…
map
Input
key-value pairs
Intermediate
key-value pairs
k v
k v
k v
…
k v
MapReduce: The Reduce
Step
25
k v
…
Intermediate
key-value pairs
k v
k v
k v
Group
by
key
reduce
reduce
k v
…
k v
…
v
Key-value groups
k v v
k v v
Output
key-value pairs
k
v
k
v
More
Specifically
26
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v)  <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair
• Reduce(k’, <v’>*)  <k’, v’’>*
• All values v’ with same key k’ are reduced together
and processed in v’ order
• There is one Reduce function call per unique key k’
• Note: * means a set
Map-Reduce: Environment
27
Map-Reduce environment takes care of:
• Partitioning the input data
• Scheduling the program’s execution across a
set of machines
• Performing the group by key step
• Handling machine failures
• Managing required inter-machine
communication
Map-Reduce: A diagram (centralized)
28
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
Map
function
Reduce
functio
n
Map-Reduce: In Parallel
29
Nod
e
Ensures all
list with
same
key are sent to
the same
reduce node
All phases are distributed with many
Map-Reduce
30
• Programmer specifies:
• Map and Reduce and input files
• Workflow:
• Read inputs as a set of key-value-pairs
• Map transforms input kv-pairs into a new set
of k'v'-pairs
• Sorts & Shuffles the k'v'-pairs to output
nodes
• All k’v’-pairs with a given k’ are sent to the
same reduce
• Reduce processes all k'v'-pairs grouped by key
into new k''v''-pairs
• Write the resulting pairs to files
• All phases are distributed with many tasks
doing the work
Input 0
Map 0
Input 1
Map 1
Input 2
Map 2
Reduce 0 Reduce 1
Out 0 Out 1
Shuffle
Data
Flow
31
• Input and final output are stored on a distributed file system (FS):
• Scheduler tries to schedule map tasks “close” to physical storage location of
input data
• Intermediate results are stored on local FS of Map and Reduce
workers
• Output is often input to another MapReduce task
MapReduce: Word
Counting
32
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors,
harbingers of a new
era of space
exploration. Scientists at
NASA are saying that
the
recent assembly of the
Dextre bot is the first
step in a long-term
space-based
man/mache
partnership.
'"The work we're
doing now
-- the robotics we're
doing -
Big document
(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
….
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to
the key and
output
(key, value)
Provided by the Provided by the
programmer
(key, value)
(key, value)
O
S
n
e
l
q
y
ue
s
n
e
ti
q
a
u
l
l
e
y
n
r
t
e
i
a
l
d
th
r
e
e
a
d
a
s
ta
Key appears programmer
1 time
Word Count Using MapReduce
33
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Refinement: Partition Function
36
• Want to control how keys get partitioned
• Inputs to map tasks are created by contiguous splits of input file
• Reduce needs to ensure that records with the same intermediate key end up
at the same worker
• System uses a default partition function:
• hash(key) mod R
• Sometimes useful to override the hash function:
• E.g., hash(hostname(URL)) mod R ensures URLs from a host
end up in the same output file
Selection by Map-Reduce
• Given a table R compute the operation 𝜎𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑅 , where
condition is a logical expression. Selection chooses the tuples from R
that satisfy the condition.
• Alternatively, in SQL, the operation is
• SELECT attributes
• FROM R
• WHERE “condition is true”
• 𝑹
37
A B
a1:
b1
a2 b1
a3 b2
a4 b3
1
𝜎𝐵=𝑏
(𝑅)
A B
a1:
b1
a2 b1
Selection by Map-Reduce
38
• Mapper
• Emit only tuples that satisfy the selection condition
• The (key,value) pairs are of the form (t,t), where t is a tuples satisfying the
condition.
• In the previous example
((a1,b1),(a1,b1))
((a2,b1),(a2,b1))
• Reducer
• No reducer is required
GROUP BY and AGGREGATION by Map-Reduce
39
• Compute the query, Q:
SELECT A,B,SUM(C)
FROM R
GROUP BY A,B
• R(A,B,C) is stored in a
file
A B C
A1 b1 1
A1 b1 2
A3 b2 3
A3 b2 4
R
A B C
A1 b1 3
A3 b2 7
�
�
Α, B,
SUM(C)
(R)
GROUP BY and AGGREGATION by Map-Reduce
40
• Mapper
• The (Key, Value) pairs are of the form:
• Key =<attributes in GROUP BY>
• Value =<attributes in AGGREGATION.
• In the previous example
((A1,B1),1)
((A1,B1),2)
((A3,B2),3)
((A3,B2),4)
• Reducer
• Applies the aggregation operation on the list of values and outputs results
• In the previous example
(A1,B1,3)
(A3,B2,7)
Natural Join By Map-Reduce
41
• Compute the natural join R(A,B) ⋈ S(B,C)
• R and S are each stored in files
• Tuples are pairs (a,b) or (b,c)
A B
a1 b1
a2 b1
a3 b2
a4 b3
B C
b2 c1
b2 c2
b3 c3
⋈
A B C
a3 b2 c1
a3 b2 c2
a4 b3 c3
=
R
S
R⋈S
Map-Reduce Join
42
• Use a hash function h from B-values to 1...k
• A Map process turns:
• Each input tuple R(a,b) into key-value pair (b,(a,R))
• Each input tuple S(b,c) into (b,(c,S))
• Map processes send each key-value pair with key b to Reduce process h(b)
• Hadoop does this automatically; just tell it what k is.
• Each Reduce process matches all the pairs (b,(a,R)) with all (b,(c,S)) and
outputs (a,b,c).
Matrix-Vector Multiplication with Map/Reduce
Task: Compute product C = A·B
1 0
0 3 4
5 0 6
A
2
20
B
0 110
120
C
20
X =
X
Computing Sparse Matrix-Vector Product
〈 2, 1, 10,
B 〉
〈 3, 1, 20,
B 〉
● Represent matrix as list of nonzero entries 〈 row, col, value,
matrixID 〉
● Strategy
○ Phase 1: Compute all products ai,k · bk,j
○ Phase 2: Sum products for each entry i,j
○ Each phase involves a Map/Reduce
0
20
B
1
〈 1, 1, 1, A 〉
0 〈 1, 2, 2,
A 〉
〈 2, 2, 3, A 〉
〈 2, 3, 4,
A 〉
〈 3, 1, 5,
A 〉
〈 3, 3, 6,
A 〉
0 3 4
5 0 6
A
2
Group by - Map of Matrix-Vector Multiply
Key = 2
Key = 3
〈 row, col, value, matrixID 〉→ (col, (matrixID, row,
value))
〈 1, 1, 1, A 〉→ (1, (A, 1,
1))
〈 1, 2, 2, A 〉→ (2, (A, 1,
2))
〈 2, 2, 3, A 〉→ (2, (A, 2,
3))
〈 2, 3, 4, A 〉→ (3, (A, 2,
4))
〈 3, 1, 5, A 〉→ (1, (A, 3,
5))
〈 3, 3, 6, A 〉→ (3, (A, 3,
6))
Key = col
〈 2, 1, 10, B 〉→ (2, (B, 1,
10))
〈 3, 1, 20, B 〉→ (3, (B, 1,
20))
Key = row
Key = 1
(1, (A, 1, 1))
(1, (A, 3, 5))
(2, (A, 1, 2))
(2, (A, 2, 3))
(2, (B, 1, 10))
(3, (A, 2, 4))
(3, (A, 3, 6))
(3, (B, 1, 20))
Group By
Intermediate Keys
Mapping of Initial representation
to intermediate keys
1 0
0 3 4
5 0 6
A
2
20
Group values ai,k and bk,j according to key k
B
0
Group by - Reduce of Matrix-Vector Multiply
Generate all products ai,k · bk,j
X
X
〈 1, 1, 2x10=20,
C 〉
〈 2, 1, 3x10=30,
C 〉
〈 2, 1, 4x20=80,
C 〉
〈 3, 1, 6x20=120,
C 〉
Key = 1
(1, (A, 1, 1))
(1, (A, 3, 5))
(2, (A, 1, 2))
Key = 2
(2, (A, 2, 3)) X (2, (B, 1, 10))
(3, (B, 1, 20))
Key = 3
(3, (A, 2, 4))
(3, (A, 3, 6))
1 0
0 3 4
5 0 6
A
2
20
B
0
X
Aggregate - Map of Matrix-Vector Multiply
Group products ai,k · bk,j with matching values of i and j
Key = 1,1
Key = 2,1
Key = 3,1
〈 2, 1, 30, C 〉→ ((2, 1), (C, 30))
〈 2, 1, 80, C 〉→ ((2, 1), (C, 80))
〈 3, 1, 120, C 〉→ ((3, 1), (C,
120))
Key = (row,col)
((2, 1), (C, 30))
((2, 1), (C, 80))
((1, 1), (C, 20))
((3, 1), (C, 120))
Mapping of intermediate representation to intermediate keys
〈 row, col, value, matrixID 〉→ ((row, col), (matrixID,
value))
〈 1, 1, 20, C 〉→ ((1, 1), (C, 20))
1 0
0 3 4
5 0 6
A
2
20
B
0
X
Aggregate - Reduce of Matrix-Vector Multiply
Sum products to get final entries
20
110
120
C
〈 1, 1, 20,
C 〉
〈 2, 1, 110,
C 〉
〈 3, 1, 120,
C 〉
Key = 2,1
((2, 1), (C, 30))
((2, 1), (C, 80))
Key = 1,1 ((1, 1), (C, 20)) +
+
Key = 3,1 ((3, 1), (C, 120))+
1 0
0 3 4
5 0 6
A
2
20
B
0
X
Cost Measures for Algorithms
49
• In MapReduce we quantify the cost of an algorithm using
1. Communication cost = total I/O of all processes
2. Computation cost analogous, but count only running time of
processes
Note that here the big-O notation is not the most
useful (adding more machines is always an option)
Example: Hosting size
50
• Suppose we have a large web corpus
• Look at the metadata file
• Lines of the form: (URL, size, date, …)
• For each host, find the total number of bytes
• That is, the sum of the page sizes for all URLs from that particular host
• Map
• For each record, output (hostname(URL), size)
• Reduce
• Sum up the sizes of each host
Example: Language Model
51
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents
• Very easy with MapReduce:
• Map:
• Extract (5-word sequence, count) from document
• Reduce:
• Combine the counts

More Related Content

PPTX
This gives a brief detail about big data
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
PPTX
introduction to Complete Map and Reduce Framework
PPTX
Hadoop-part1 in cloud computing subject.pptx
PPTX
Hadoop
PPTX
Large scale computing with mapreduce
PDF
PDF
Introduction to Big Data
This gives a brief detail about big data
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
introduction to Complete Map and Reduce Framework
Hadoop-part1 in cloud computing subject.pptx
Hadoop
Large scale computing with mapreduce
Introduction to Big Data

Similar to Lecture2-MapReduce - An introductory lecture to Map Reduce (20)

PPT
Hadoop - Introduction to HDFS
PPT
11. From Hadoop to Spark 1:2
PPTX
MapReduce.pptx
PDF
Hadoop trainting in hyderabad@kelly technologies
PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
PPT
L19CloudMapReduce introduction for cloud computing .ppt
PDF
MapReduce Algorithm Design - Parallel Reduce Operations
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PPTX
ch02-mapreduce.pptx
PDF
ENAR short course
PPTX
Hadoop: A distributed framework for Big Data
PPT
Hadoop trainting-in-hyderabad@kelly technologies
PPTX
Introduction to Hadoop
PDF
MapReduce and the New Software Stack
PPT
Hadoop institutes-in-bangalore
PPT
PPTX
Hadoop and Mapreduce for .NET User Group
Hadoop - Introduction to HDFS
11. From Hadoop to Spark 1:2
MapReduce.pptx
Hadoop trainting in hyderabad@kelly technologies
design mapping lecture6-mapreducealgorithmdesign.ppt
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
L19CloudMapReduce introduction for cloud computing .ppt
MapReduce Algorithm Design - Parallel Reduce Operations
Distributed Computing with Apache Hadoop: Technology Overview
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
ch02-mapreduce.pptx
ENAR short course
Hadoop: A distributed framework for Big Data
Hadoop trainting-in-hyderabad@kelly technologies
Introduction to Hadoop
MapReduce and the New Software Stack
Hadoop institutes-in-bangalore
Hadoop and Mapreduce for .NET User Group
Ad

More from ssuserb91a20 (7)

PPTX
Εργαλεία Αξιοποίησης Μεγάλων Δεδομένων.pptx
PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
PPTX
Κατανεμημένα συστήματα και Map Reduce.pptx
PPTX
Install Hadoop with Virtual Box Instructions
PPTX
Hadoop and MapReduce Introductort presentation
PPTX
Map Reduced and Data Mining Introductory Presentation
PPTX
MapReduce and Hadoop Introcuctory Presentation
Εργαλεία Αξιοποίησης Μεγάλων Δεδομένων.pptx
Chapter1-Introduction Εισαγωγικές έννοιες
Κατανεμημένα συστήματα και Map Reduce.pptx
Install Hadoop with Virtual Box Instructions
Hadoop and MapReduce Introductort presentation
Map Reduced and Data Mining Introductory Presentation
MapReduce and Hadoop Introcuctory Presentation
Ad

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Well-logging-methods_new................
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Artificial Intelligence
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
additive manufacturing of ss316l using mig welding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
PPT on Performance Review to get promotions
Construction Project Organization Group 2.pptx
Mechanical Engineering MATERIALS Selection
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Embodied AI: Ushering in the Next Era of Intelligent Systems
bas. eng. economics group 4 presentation 1.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Internet of Things (IOT) - A guide to understanding
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Well-logging-methods_new................
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Artificial Intelligence
CH1 Production IntroductoryConcepts.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
additive manufacturing of ss316l using mig welding
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT on Performance Review to get promotions

Lecture2-MapReduce - An introductory lecture to Map Reduce

  • 1. Tutorial Meeting #2 Data Science and Machine Learning Map-Reduce and the New Software Stack Περικλής Ανδρίτσος
  • 2. Introduction 2 • Once upon a time…. • There used to be just a single computer • It used to perform calculations all by itself • Later in the years to come…. • Data sources started proliferating • Size of the data sources that are used in calculations exploded • Advanced programming techniques started to appear • That’s when advanced science came in and gave us “distributed computation” • MapReduce is a representative of such technology
  • 3. MapReduce (section 2) 3 • Challenges: • How to distribute computation? • Distributed/parallel programming is hard • Map-reduce addresses all of the above • Google’s computational/data manipulation model • Elegant way to work with big data
  • 4. Single Node Architecture (section 2.1) 4 Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining { Main needs: a storage infrastructure and a programming paradigm (for analysis)
  • 5. Motivation: Google Example 5 • 20+ billion web pages x 20KB = 400+ TB • 1 computer reads 30-35 MB/sec from disk • ~4 months to read the web • ~1,000 hard drives to store the web • Takes even more to do something useful with the data, both in terms of storing it as well as the time needed to analyze it !! • Today, a standard architecture for such problems is emerging: • Cluster of commodity Linux nodes • Commodity network (ethernet) to connect them
  • 6. Big picture of the new paradigm 6 • Split the data into chunks • Have multiple disks and CPUs • Distribute the data chunks across multiple disks • Process the data in parallel across CPUs • Example: • Time to read data: 4million secs (46+ days) • If we had 1000 CPUs we can do the task in • 4 million / 1000 = 4000 seconds (about 1 hour)
  • 7. Cluster Architecture (Section 2.1.1) 7 Mem Disk CPU Mem Disk CPU … Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU … Switch Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks It has been estimated that Google has 1M machines, http://bit.ly/Shh0RO
  • 8. 8
  • 9. Large-scale Computing 9 • Large-scale computing for data mining problems on commodity hardware • Challenges: • How do you distribute computation? • How can we make it easy to write distributed programs? • Machines fail: • One server may stay up 3 years (1,000 days) • If you have 1,000 servers, expect to lose 1/day • If Google has ~1M machines, then • 1,000 machines fail every day!
  • 10. Large-scale Computing (2) 10 • Problem #1: If nodes fail at such a rate, how do we make sure data is stored PERSISTENTLY • i.e., we are guaranteed to have data (or its copies) if machines that store it fail. • What if nodes fail during a long-running process what do we do ? • E.g., do we restart the process all over again ? • Solution: a new infrastructure that ”hides” all these failures
  • 11. Large-scale Computing (3) 11 • Problem #2: Network Bottleneck • If bandwidth = 1 Gbps, moving 10TB takes ~1 day • In typical applications data moves around to be analyzed • In our case we should be moving enormous amounts of data into thousands of servers and that can be prohibitive • Solution: a new framework that doesn’t move data around ! • In my opinion: the big beauty of MapReduce !!
  • 12. Map-Reduce (Section 2.2) 12 • It addresses previously mentioned problems in the following ways: • Stores data redundantly on multiple nodes (computers) for persistence and availability • Move computation close to the data to minimize data movement. This is a simple yet powerful idea. • Simple programming model that even non-savvy programmers can implement
  • 13. Idea and Solution 13 • Issue: Copying data over a network takes time • Idea: • Bring computation close to the data • Store files multiple times for reliability • Map-reduce addresses these problems • Google’s computational/data manipulation model • Elegant way to work with big data • Storage Infrastructure – File system • Google: GFS. Hadoop: HDFS • Programming model • Map-Reduce
  • 14. Storage Infrastructure 14 • Problem: • If nodes fail, how to store data persistently? • Answer: • Distributed File System (redundant storage): • Provides global file namespace • Implementations: Google GFS; Hadoop HDFS; • Stores each data piece multiple times across servers • Typical usage pattern • Huge files (100s of GB to TB) • Data is rarely updated in place (e.g. storing URLs) • Reads and appends are common
  • 15. Distributed File System 15 • Reliable distributed file system • Data kept in “chunks” spread across machines • Each chunk replicated on different machines • Seamless recovery from disk or machine failure C0 C1 C2 C5 D1 C5 C1 C3 C5 … C2 D0 D0 Bring computation directly to the data! C0 C5 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N C2 D0 Chunk servers also serve as compute servers
  • 16. Distributed File System 16 • Chunk servers • File is split into contiguous chunks • Typically each chunk is 16-64MB • Each chunk replicated (usually 2x or 3x) • Try to keep replicas in different racks • Master node • a.k.a. Name Node in Hadoop’s HDFS • Stores information about where files are stored • Might be replicated • Client library for file access • Talks to master to find chunk servers • Connects directly to chunk servers to access data
  • 17. Coordination: Master 17 • Master node takes care of coordination: • Task status: (idle, in-progress, completed) • Idle tasks get scheduled as workers become available • When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer • Master pushes this info to reducers • Master pings workers periodically to detect failures
  • 18. Dealing with Failures 18 • Map worker failure • Map tasks completed or in-progress at worker are reset to idle • Reduce workers are notified when task is rescheduled on another worker • Reduce worker failure • Only in-progress tasks are reset to idle • Reduce task is restarted • Master failure • MapReduce task is aborted and client is notified
  • 19. Fault tolerance: Handled via re-execution • On worker failure: • Detect failure via periodic heartbeats • Re-execute completed and in-progress map tasks • Task completion committed through master • Robust: Google lost 1600 of 1800 machines, but kept running fine
  • 20. Programming Model: MapReduce 20 Warm-up task: • We have a huge text document • Maybe 10-20TB of size • Count the number of times each distinct word appears in the file • Sample application: • Analyze web server logs to find popular URLs • Keyword statistic for search
  • 21. Task: Word Count 21 Case 1: • File too large for memory, but all <word, count> pairs fit in memory <- simple program ! Case 2: • Count occurrences of words: • words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line • Case 2 captures the essence of MapReduce • Great thing is that it is naturally parallelizable
  • 22. MapReduce: Overview 22 words(doc.txt) | sort • Map • Scan input file one record at a time • Extract something you care about from each record (keys) • Group by key • Groups all the keys with the same value • Sort and shuffle • Reduce • Run a function • Aggregate, summarize, filter or transform • Write the answer
  • 23. MapReduce: The Map Step 24 v k map v k v k … map Input key-value pairs Intermediate key-value pairs k v k v k v … k v
  • 24. MapReduce: The Reduce Step 25 k v … Intermediate key-value pairs k v k v k v Group by key reduce reduce k v … k v … v Key-value groups k v v k v v Output key-value pairs k v k v
  • 25. More Specifically 26 • Input: a set of key-value pairs • Programmer specifies two methods: • Map(k, v)  <k’, v’>* • Takes a key-value pair and outputs a set of key-value pairs • E.g., key is the filename, value is a single line in the file • There is one Map call for every (k,v) pair • Reduce(k’, <v’>*)  <k’, v’’>* • All values v’ with same key k’ are reduced together and processed in v’ order • There is one Reduce function call per unique key k’ • Note: * means a set
  • 26. Map-Reduce: Environment 27 Map-Reduce environment takes care of: • Partitioning the input data • Scheduling the program’s execution across a set of machines • Performing the group by key step • Handling machine failures • Managing required inter-machine communication
  • 27. Map-Reduce: A diagram (centralized) 28 Big document MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key (Hash merge, Shuffle, Sort, Partition) Reduce: Collect all values belonging to the key and output Map function Reduce functio n
  • 28. Map-Reduce: In Parallel 29 Nod e Ensures all list with same key are sent to the same reduce node All phases are distributed with many
  • 29. Map-Reduce 30 • Programmer specifies: • Map and Reduce and input files • Workflow: • Read inputs as a set of key-value-pairs • Map transforms input kv-pairs into a new set of k'v'-pairs • Sorts & Shuffles the k'v'-pairs to output nodes • All k’v’-pairs with a given k’ are sent to the same reduce • Reduce processes all k'v'-pairs grouped by key into new k''v''-pairs • Write the resulting pairs to files • All phases are distributed with many tasks doing the work Input 0 Map 0 Input 1 Map 1 Input 2 Map 2 Reduce 0 Reduce 1 Out 0 Out 1 Shuffle
  • 30. Data Flow 31 • Input and final output are stored on a distributed file system (FS): • Scheduler tries to schedule map tasks “close” to physical storage location of input data • Intermediate results are stored on local FS of Map and Reduce workers • Output is often input to another MapReduce task
  • 31. MapReduce: Word Counting 32 The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing - Big document (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output (key, value) Provided by the Provided by the programmer (key, value) (key, value) O S n e l q y ue s n e ti q a u l l e y n r t e i a l d th r e e a d a s ta Key appears programmer 1 time
  • 32. Word Count Using MapReduce 33 map(key, value): // key: document name; value: text of the document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)
  • 33. Refinement: Partition Function 36 • Want to control how keys get partitioned • Inputs to map tasks are created by contiguous splits of input file • Reduce needs to ensure that records with the same intermediate key end up at the same worker • System uses a default partition function: • hash(key) mod R • Sometimes useful to override the hash function: • E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file
  • 34. Selection by Map-Reduce • Given a table R compute the operation 𝜎𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑅 , where condition is a logical expression. Selection chooses the tuples from R that satisfy the condition. • Alternatively, in SQL, the operation is • SELECT attributes • FROM R • WHERE “condition is true” • 𝑹 37 A B a1: b1 a2 b1 a3 b2 a4 b3 1 𝜎𝐵=𝑏 (𝑅) A B a1: b1 a2 b1
  • 35. Selection by Map-Reduce 38 • Mapper • Emit only tuples that satisfy the selection condition • The (key,value) pairs are of the form (t,t), where t is a tuples satisfying the condition. • In the previous example ((a1,b1),(a1,b1)) ((a2,b1),(a2,b1)) • Reducer • No reducer is required
  • 36. GROUP BY and AGGREGATION by Map-Reduce 39 • Compute the query, Q: SELECT A,B,SUM(C) FROM R GROUP BY A,B • R(A,B,C) is stored in a file A B C A1 b1 1 A1 b1 2 A3 b2 3 A3 b2 4 R A B C A1 b1 3 A3 b2 7 � � Α, B, SUM(C) (R)
  • 37. GROUP BY and AGGREGATION by Map-Reduce 40 • Mapper • The (Key, Value) pairs are of the form: • Key =<attributes in GROUP BY> • Value =<attributes in AGGREGATION. • In the previous example ((A1,B1),1) ((A1,B1),2) ((A3,B2),3) ((A3,B2),4) • Reducer • Applies the aggregation operation on the list of values and outputs results • In the previous example (A1,B1,3) (A3,B2,7)
  • 38. Natural Join By Map-Reduce 41 • Compute the natural join R(A,B) ⋈ S(B,C) • R and S are each stored in files • Tuples are pairs (a,b) or (b,c) A B a1 b1 a2 b1 a3 b2 a4 b3 B C b2 c1 b2 c2 b3 c3 ⋈ A B C a3 b2 c1 a3 b2 c2 a4 b3 c3 = R S R⋈S
  • 39. Map-Reduce Join 42 • Use a hash function h from B-values to 1...k • A Map process turns: • Each input tuple R(a,b) into key-value pair (b,(a,R)) • Each input tuple S(b,c) into (b,(c,S)) • Map processes send each key-value pair with key b to Reduce process h(b) • Hadoop does this automatically; just tell it what k is. • Each Reduce process matches all the pairs (b,(a,R)) with all (b,(c,S)) and outputs (a,b,c).
  • 40. Matrix-Vector Multiplication with Map/Reduce Task: Compute product C = A·B 1 0 0 3 4 5 0 6 A 2 20 B 0 110 120 C 20 X =
  • 41. X Computing Sparse Matrix-Vector Product 〈 2, 1, 10, B 〉 〈 3, 1, 20, B 〉 ● Represent matrix as list of nonzero entries 〈 row, col, value, matrixID 〉 ● Strategy ○ Phase 1: Compute all products ai,k · bk,j ○ Phase 2: Sum products for each entry i,j ○ Each phase involves a Map/Reduce 0 20 B 1 〈 1, 1, 1, A 〉 0 〈 1, 2, 2, A 〉 〈 2, 2, 3, A 〉 〈 2, 3, 4, A 〉 〈 3, 1, 5, A 〉 〈 3, 3, 6, A 〉 0 3 4 5 0 6 A 2
  • 42. Group by - Map of Matrix-Vector Multiply Key = 2 Key = 3 〈 row, col, value, matrixID 〉→ (col, (matrixID, row, value)) 〈 1, 1, 1, A 〉→ (1, (A, 1, 1)) 〈 1, 2, 2, A 〉→ (2, (A, 1, 2)) 〈 2, 2, 3, A 〉→ (2, (A, 2, 3)) 〈 2, 3, 4, A 〉→ (3, (A, 2, 4)) 〈 3, 1, 5, A 〉→ (1, (A, 3, 5)) 〈 3, 3, 6, A 〉→ (3, (A, 3, 6)) Key = col 〈 2, 1, 10, B 〉→ (2, (B, 1, 10)) 〈 3, 1, 20, B 〉→ (3, (B, 1, 20)) Key = row Key = 1 (1, (A, 1, 1)) (1, (A, 3, 5)) (2, (A, 1, 2)) (2, (A, 2, 3)) (2, (B, 1, 10)) (3, (A, 2, 4)) (3, (A, 3, 6)) (3, (B, 1, 20)) Group By Intermediate Keys Mapping of Initial representation to intermediate keys 1 0 0 3 4 5 0 6 A 2 20 Group values ai,k and bk,j according to key k B 0
  • 43. Group by - Reduce of Matrix-Vector Multiply Generate all products ai,k · bk,j X X 〈 1, 1, 2x10=20, C 〉 〈 2, 1, 3x10=30, C 〉 〈 2, 1, 4x20=80, C 〉 〈 3, 1, 6x20=120, C 〉 Key = 1 (1, (A, 1, 1)) (1, (A, 3, 5)) (2, (A, 1, 2)) Key = 2 (2, (A, 2, 3)) X (2, (B, 1, 10)) (3, (B, 1, 20)) Key = 3 (3, (A, 2, 4)) (3, (A, 3, 6)) 1 0 0 3 4 5 0 6 A 2 20 B 0 X
  • 44. Aggregate - Map of Matrix-Vector Multiply Group products ai,k · bk,j with matching values of i and j Key = 1,1 Key = 2,1 Key = 3,1 〈 2, 1, 30, C 〉→ ((2, 1), (C, 30)) 〈 2, 1, 80, C 〉→ ((2, 1), (C, 80)) 〈 3, 1, 120, C 〉→ ((3, 1), (C, 120)) Key = (row,col) ((2, 1), (C, 30)) ((2, 1), (C, 80)) ((1, 1), (C, 20)) ((3, 1), (C, 120)) Mapping of intermediate representation to intermediate keys 〈 row, col, value, matrixID 〉→ ((row, col), (matrixID, value)) 〈 1, 1, 20, C 〉→ ((1, 1), (C, 20)) 1 0 0 3 4 5 0 6 A 2 20 B 0 X
  • 45. Aggregate - Reduce of Matrix-Vector Multiply Sum products to get final entries 20 110 120 C 〈 1, 1, 20, C 〉 〈 2, 1, 110, C 〉 〈 3, 1, 120, C 〉 Key = 2,1 ((2, 1), (C, 30)) ((2, 1), (C, 80)) Key = 1,1 ((1, 1), (C, 20)) + + Key = 3,1 ((3, 1), (C, 120))+ 1 0 0 3 4 5 0 6 A 2 20 B 0 X
  • 46. Cost Measures for Algorithms 49 • In MapReduce we quantify the cost of an algorithm using 1. Communication cost = total I/O of all processes 2. Computation cost analogous, but count only running time of processes Note that here the big-O notation is not the most useful (adding more machines is always an option)
  • 47. Example: Hosting size 50 • Suppose we have a large web corpus • Look at the metadata file • Lines of the form: (URL, size, date, …) • For each host, find the total number of bytes • That is, the sum of the page sizes for all URLs from that particular host • Map • For each record, output (hostname(URL), size) • Reduce • Sum up the sizes of each host
  • 48. Example: Language Model 51 • Statistical machine translation: • Need to count number of times every 5-word sequence occurs in a large corpus of documents • Very easy with MapReduce: • Map: • Extract (5-word sequence, count) from document • Reduce: • Combine the counts