Lecture2-MapReduce - An introductory lecture to Map Reduce
1. Tutorial Meeting #2
Data Science and Machine
Learning
Map-Reduce and
the New Software Stack
Περικλής Ανδρίτσος
2. Introduction
2
• Once upon a time….
• There used to be just a single computer
• It used to perform calculations all by itself
• Later in the years to come….
• Data sources started proliferating
• Size of the data sources that are used in calculations exploded
• Advanced programming techniques started to appear
• That’s when advanced science came in and gave us “distributed
computation”
• MapReduce is a representative of such technology
3. MapReduce (section
2)
3
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard
• Map-reduce addresses all of the above
• Google’s computational/data manipulation model
• Elegant way to work with big data
4. Single Node Architecture (section
2.1)
4
Memory
Disk
CPU
Machine Learning, Statistics
“Classical” Data Mining
{
Main needs:
a storage infrastructure and a programming paradigm (for
analysis)
5. Motivation: Google Example
5
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data, both in terms of storing it as well as the time needed
to analyze it !!
• Today, a standard architecture for such problems is emerging:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them
6. Big picture of the new
paradigm
6
• Split the data into chunks
• Have multiple disks and CPUs
• Distribute the data chunks across multiple disks
• Process the data in parallel across CPUs
• Example:
• Time to read data: 4million secs (46+ days)
• If we had 1000 CPUs we can do the task in
• 4 million / 1000 = 4000 seconds (about 1 hour)
7. Cluster Architecture (Section 2.1.1)
7
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch
1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
It has been estimated that Google has 1M machines,
http://bit.ly/Shh0RO
9. Large-scale
Computing
9
• Large-scale computing for data mining
problems on commodity hardware
• Challenges:
• How do you distribute computation?
• How can we make it easy to write distributed programs?
• Machines fail:
• One server may stay up 3 years (1,000 days)
• If you have 1,000 servers, expect to lose 1/day
• If Google has ~1M machines, then
• 1,000 machines fail every day!
10. Large-scale Computing (2)
10
• Problem #1: If nodes fail at such a rate, how do we make sure data is
stored PERSISTENTLY
• i.e., we are guaranteed to have data (or its copies) if machines that store it
fail.
• What if nodes fail during a long-running process what do we do ?
• E.g., do we restart the process all over again ?
• Solution: a new infrastructure that ”hides” all these failures
11. Large-scale Computing (3)
11
• Problem #2: Network Bottleneck
• If bandwidth = 1 Gbps, moving 10TB takes ~1 day
• In typical applications data moves around to be analyzed
• In our case we should be moving enormous amounts of data into thousands
of servers and that can be prohibitive
• Solution: a new framework that doesn’t move data around !
• In my opinion: the big beauty of MapReduce !!
12. Map-Reduce (Section
2.2)
12
• It addresses previously mentioned problems in the following ways:
• Stores data redundantly on multiple nodes (computers) for persistence and
availability
• Move computation close to the data to minimize data movement. This is a
simple yet powerful idea.
• Simple programming model that even non-savvy programmers can
implement
13. Idea and
Solution
13
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Google’s computational/data manipulation model
• Elegant way to work with big data
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce
14. Storage Infrastructure
14
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System (redundant storage):
• Provides global file namespace
• Implementations: Google GFS; Hadoop HDFS;
• Stores each data piece multiple times across servers
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place (e.g. storing URLs)
• Reads and appends are common
15. Distributed File System
15
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure
C0 C1
C2
C5
D1
C5
C1
C3
C5
…
C2
D0
D0
Bring computation directly to the data!
C0 C5
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
C2
D0
Chunk servers also serve as compute servers
16. Distributed File System
16
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores information about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
17. Coordination: Master
17
• Master node takes care of coordination:
• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its
R intermediate files, one for each reducer
• Master pushes this info to reducers
• Master pings workers periodically to detect failures
18. Dealing with Failures
18
• Map worker failure
• Map tasks completed or in-progress at
worker are reset to idle
• Reduce workers are notified when task is rescheduled on another worker
• Reduce worker failure
• Only in-progress tasks are reset to idle
• Reduce task is restarted
• Master failure
• MapReduce task is aborted and client is notified
19. Fault tolerance:
Handled via re-execution
• On worker failure:
• Detect failure via periodic heartbeats
• Re-execute completed and in-progress map tasks
• Task completion committed through master
• Robust: Google lost 1600 of 1800 machines, but kept running fine
20. Programming Model: MapReduce
20
Warm-up task:
• We have a huge text document
• Maybe 10-20TB of size
• Count the number of times each
distinct word appears in the file
• Sample application:
• Analyze web server logs to find popular URLs
• Keyword statistic for search
21. Task: Word
Count
21
Case 1:
• File too large for memory, but all <word, count> pairs fit in memory
<-
simple program !
Case 2:
• Count occurrences of words:
• words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
• Case 2 captures the essence of MapReduce
• Great thing is that it is naturally parallelizable
22. MapReduce: Overview
22
words(doc.txt) | sort
• Map
• Scan input file one record at a time
• Extract something you care about from each record (keys)
• Group by key
• Groups all the keys with the same value
• Sort and shuffle
• Reduce
• Run a function
• Aggregate, summarize, filter or transform
• Write the answer
24. MapReduce: The Reduce
Step
25
k v
…
Intermediate
key-value pairs
k v
k v
k v
Group
by
key
reduce
reduce
k v
…
k v
…
v
Key-value groups
k v v
k v v
Output
key-value pairs
k
v
k
v
25. More
Specifically
26
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v) <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair
• Reduce(k’, <v’>*) <k’, v’’>*
• All values v’ with same key k’ are reduced together
and processed in v’ order
• There is one Reduce function call per unique key k’
• Note: * means a set
26. Map-Reduce: Environment
27
Map-Reduce environment takes care of:
• Partitioning the input data
• Scheduling the program’s execution across a
set of machines
• Performing the group by key step
• Handling machine failures
• Managing required inter-machine
communication
27. Map-Reduce: A diagram (centralized)
28
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
Map
function
Reduce
functio
n
29. Map-Reduce
30
• Programmer specifies:
• Map and Reduce and input files
• Workflow:
• Read inputs as a set of key-value-pairs
• Map transforms input kv-pairs into a new set
of k'v'-pairs
• Sorts & Shuffles the k'v'-pairs to output
nodes
• All k’v’-pairs with a given k’ are sent to the
same reduce
• Reduce processes all k'v'-pairs grouped by key
into new k''v''-pairs
• Write the resulting pairs to files
• All phases are distributed with many tasks
doing the work
Input 0
Map 0
Input 1
Map 1
Input 2
Map 2
Reduce 0 Reduce 1
Out 0 Out 1
Shuffle
30. Data
Flow
31
• Input and final output are stored on a distributed file system (FS):
• Scheduler tries to schedule map tasks “close” to physical storage location of
input data
• Intermediate results are stored on local FS of Map and Reduce
workers
• Output is often input to another MapReduce task
31. MapReduce: Word
Counting
32
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors,
harbingers of a new
era of space
exploration. Scientists at
NASA are saying that
the
recent assembly of the
Dextre bot is the first
step in a long-term
space-based
man/mache
partnership.
'"The work we're
doing now
-- the robotics we're
doing -
Big document
(The, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(Endeavor, 1)
(recently, 1)
….
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to
the key and
output
(key, value)
Provided by the Provided by the
programmer
(key, value)
(key, value)
O
S
n
e
l
q
y
ue
s
n
e
ti
q
a
u
l
l
e
y
n
r
t
e
i
a
l
d
th
r
e
e
a
d
a
s
ta
Key appears programmer
1 time
32. Word Count Using MapReduce
33
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
33. Refinement: Partition Function
36
• Want to control how keys get partitioned
• Inputs to map tasks are created by contiguous splits of input file
• Reduce needs to ensure that records with the same intermediate key end up
at the same worker
• System uses a default partition function:
• hash(key) mod R
• Sometimes useful to override the hash function:
• E.g., hash(hostname(URL)) mod R ensures URLs from a host
end up in the same output file
34. Selection by Map-Reduce
• Given a table R compute the operation 𝜎𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑅 , where
condition is a logical expression. Selection chooses the tuples from R
that satisfy the condition.
• Alternatively, in SQL, the operation is
• SELECT attributes
• FROM R
• WHERE “condition is true”
• 𝑹
37
A B
a1:
b1
a2 b1
a3 b2
a4 b3
1
𝜎𝐵=𝑏
(𝑅)
A B
a1:
b1
a2 b1
35. Selection by Map-Reduce
38
• Mapper
• Emit only tuples that satisfy the selection condition
• The (key,value) pairs are of the form (t,t), where t is a tuples satisfying the
condition.
• In the previous example
((a1,b1),(a1,b1))
((a2,b1),(a2,b1))
• Reducer
• No reducer is required
36. GROUP BY and AGGREGATION by Map-Reduce
39
• Compute the query, Q:
SELECT A,B,SUM(C)
FROM R
GROUP BY A,B
• R(A,B,C) is stored in a
file
A B C
A1 b1 1
A1 b1 2
A3 b2 3
A3 b2 4
R
A B C
A1 b1 3
A3 b2 7
�
�
Α, B,
SUM(C)
(R)
37. GROUP BY and AGGREGATION by Map-Reduce
40
• Mapper
• The (Key, Value) pairs are of the form:
• Key =<attributes in GROUP BY>
• Value =<attributes in AGGREGATION.
• In the previous example
((A1,B1),1)
((A1,B1),2)
((A3,B2),3)
((A3,B2),4)
• Reducer
• Applies the aggregation operation on the list of values and outputs results
• In the previous example
(A1,B1,3)
(A3,B2,7)
38. Natural Join By Map-Reduce
41
• Compute the natural join R(A,B) ⋈ S(B,C)
• R and S are each stored in files
• Tuples are pairs (a,b) or (b,c)
A B
a1 b1
a2 b1
a3 b2
a4 b3
B C
b2 c1
b2 c2
b3 c3
⋈
A B C
a3 b2 c1
a3 b2 c2
a4 b3 c3
=
R
S
R⋈S
39. Map-Reduce Join
42
• Use a hash function h from B-values to 1...k
• A Map process turns:
• Each input tuple R(a,b) into key-value pair (b,(a,R))
• Each input tuple S(b,c) into (b,(c,S))
• Map processes send each key-value pair with key b to Reduce process h(b)
• Hadoop does this automatically; just tell it what k is.
• Each Reduce process matches all the pairs (b,(a,R)) with all (b,(c,S)) and
outputs (a,b,c).
41. X
Computing Sparse Matrix-Vector Product
〈 2, 1, 10,
B 〉
〈 3, 1, 20,
B 〉
● Represent matrix as list of nonzero entries 〈 row, col, value,
matrixID 〉
● Strategy
○ Phase 1: Compute all products ai,k · bk,j
○ Phase 2: Sum products for each entry i,j
○ Each phase involves a Map/Reduce
0
20
B
1
〈 1, 1, 1, A 〉
0 〈 1, 2, 2,
A 〉
〈 2, 2, 3, A 〉
〈 2, 3, 4,
A 〉
〈 3, 1, 5,
A 〉
〈 3, 3, 6,
A 〉
0 3 4
5 0 6
A
2
42. Group by - Map of Matrix-Vector Multiply
Key = 2
Key = 3
〈 row, col, value, matrixID 〉→ (col, (matrixID, row,
value))
〈 1, 1, 1, A 〉→ (1, (A, 1,
1))
〈 1, 2, 2, A 〉→ (2, (A, 1,
2))
〈 2, 2, 3, A 〉→ (2, (A, 2,
3))
〈 2, 3, 4, A 〉→ (3, (A, 2,
4))
〈 3, 1, 5, A 〉→ (1, (A, 3,
5))
〈 3, 3, 6, A 〉→ (3, (A, 3,
6))
Key = col
〈 2, 1, 10, B 〉→ (2, (B, 1,
10))
〈 3, 1, 20, B 〉→ (3, (B, 1,
20))
Key = row
Key = 1
(1, (A, 1, 1))
(1, (A, 3, 5))
(2, (A, 1, 2))
(2, (A, 2, 3))
(2, (B, 1, 10))
(3, (A, 2, 4))
(3, (A, 3, 6))
(3, (B, 1, 20))
Group By
Intermediate Keys
Mapping of Initial representation
to intermediate keys
1 0
0 3 4
5 0 6
A
2
20
Group values ai,k and bk,j according to key k
B
0
43. Group by - Reduce of Matrix-Vector Multiply
Generate all products ai,k · bk,j
X
X
〈 1, 1, 2x10=20,
C 〉
〈 2, 1, 3x10=30,
C 〉
〈 2, 1, 4x20=80,
C 〉
〈 3, 1, 6x20=120,
C 〉
Key = 1
(1, (A, 1, 1))
(1, (A, 3, 5))
(2, (A, 1, 2))
Key = 2
(2, (A, 2, 3)) X (2, (B, 1, 10))
(3, (B, 1, 20))
Key = 3
(3, (A, 2, 4))
(3, (A, 3, 6))
1 0
0 3 4
5 0 6
A
2
20
B
0
X
44. Aggregate - Map of Matrix-Vector Multiply
Group products ai,k · bk,j with matching values of i and j
Key = 1,1
Key = 2,1
Key = 3,1
〈 2, 1, 30, C 〉→ ((2, 1), (C, 30))
〈 2, 1, 80, C 〉→ ((2, 1), (C, 80))
〈 3, 1, 120, C 〉→ ((3, 1), (C,
120))
Key = (row,col)
((2, 1), (C, 30))
((2, 1), (C, 80))
((1, 1), (C, 20))
((3, 1), (C, 120))
Mapping of intermediate representation to intermediate keys
〈 row, col, value, matrixID 〉→ ((row, col), (matrixID,
value))
〈 1, 1, 20, C 〉→ ((1, 1), (C, 20))
1 0
0 3 4
5 0 6
A
2
20
B
0
X
45. Aggregate - Reduce of Matrix-Vector Multiply
Sum products to get final entries
20
110
120
C
〈 1, 1, 20,
C 〉
〈 2, 1, 110,
C 〉
〈 3, 1, 120,
C 〉
Key = 2,1
((2, 1), (C, 30))
((2, 1), (C, 80))
Key = 1,1 ((1, 1), (C, 20)) +
+
Key = 3,1 ((3, 1), (C, 120))+
1 0
0 3 4
5 0 6
A
2
20
B
0
X
46. Cost Measures for Algorithms
49
• In MapReduce we quantify the cost of an algorithm using
1. Communication cost = total I/O of all processes
2. Computation cost analogous, but count only running time of
processes
Note that here the big-O notation is not the most
useful (adding more machines is always an option)
47. Example: Hosting size
50
• Suppose we have a large web corpus
• Look at the metadata file
• Lines of the form: (URL, size, date, …)
• For each host, find the total number of bytes
• That is, the sum of the page sizes for all URLs from that particular host
• Map
• For each record, output (hostname(URL), size)
• Reduce
• Sum up the sizes of each host
48. Example: Language Model
51
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents
• Very easy with MapReduce:
• Map:
• Extract (5-word sequence, count) from document
• Reduce:
• Combine the counts