SlideShare a Scribd company logo
Akhilesh Joshi
akhileshjoshi123@gmail.com
▪Local Aggregation
▪Pairs and Stripes
▪Order inversion
▪Graph algorithms
▪ In Hadoop, intermediate results are written to local disk before being sent over the
network. Since network and disk latencies are relatively expensive compared to
other operations, reductions in the amount of intermediate data translate into
increases in algorithmic efficiency.
▪ In MapReduce, local aggregation of intermediate results is one of the keys to
efficient algorithms.
▪ Hence we take help of COMBINER to perform this local aggregation and reduce
the intermediate key value pairs that are passed on by Mapper to the Reducer.
Design patterns in MapReduce
▪ Combiners provide a general mechanism within the MapReduce framework to
reduce the amount of intermediate data generated by the mappers.
▪ They can be understood as mini-reducers that process the output of mappers.
▪ Combiners aggregate term counts across the documents processed by each map
task
▪ CONCLUSION
This results in a reduction in the number of intermediate key-value pairs that
need to be shuffled across the network ==> from the order of total number of terms in
the collection to the order of the number of unique terms in the collection.
An associative array (i.e., Map in Java) is introduced inside the mapper to tally up
term counts within a single document: instead of emitting a key-value pair for each
term in the document, this version emits a key-value pair for each unique term in the
document.
NOTE : reducer is not Changed !
In this case, we initialize an associative array for holding term counts. Since it is
possible to preserve state across multiple calls of the Map method (for each input
key-value pair), we can continue to accumulate partial term counts in the associative
array across multiple documents, and emit key-value pairs only when the mapper
has processed all documents.That is emission of intermediate data is deferred until
the Close method in the pseudo-code.
IN MAPPER COMBINER
IMPLEMENTATION
▪ Advantage of using this design pattern is that we will have control over the local
aggregation
▪ In mapper combiners should be preferred over actual combiners as using actual
combiners creates an overhead for creating and destroying the objects
▪ Combiners does reduce amount of intermediate data but they keep the number of
key value pairs as it is as they are outputted from the mapper but that too in
aggregated format
▪ Yes ! Using in mapper combiner tweaks the mapper to preserve the state across
the documents
▪ This creates the bottleneck as outcome of the algorithm will depend on the type of
key value pair it receives we call it as ORDER DEPENDING BUG !
▪ It is difficult to check the problem when we are dealing with large datasets
Another Disadvantage ?? 
▪ There is need for memory where there should be sufficient in memory until all the
key values pairs are processed (there may be case in our word count example
where vocabulary size may exceed the associative array !)
▪ SOLUTION : flush in memory periodically by maintaining some counter.The size of
the blocks to be flushed is empirical and is hard to determine.
▪ Problem Statement : Compute a mean of certain Key (say key is alphanumeric
employee id and value is salary)
▪ Addition and Multiplication are associative
▪ Division and Subtraction are not associative
1. Computing Average directly in reducer (no combiner)
2. Using Combiners to reduce workload on reducer
3. Using in-memory combiner to increase efficiency of approach 2
1. Computing Average directly in reducer (no combiner)
2. Using Combiners to reduce workload on reducer : DOES NOT WORK
3. Using in-memory combiner to increase efficiency of approach 2
WHY ?
Because since average is not associative combiners will calculate average from
separate mapper classes and send them to reducer. Now reducer will take these
averages and again combine them to into average.This will lead to wrong solution
since : AVERAGE(1,2,3,4,5)=AVERAGE(AVERAGE(1,2),AVERAGE(3,4,5))
▪ NOTES :
▪ NO COMBINER USED
▪ AVERAGE IS CALCULATED IN REDUCER
▪ MAPPER USED : IDENTITY MAPPER
▪ This algorithm works but has some
problems
Problems:
1. requires shuffling all key-value pairs from mappers to reducers
across the network
2. reducer cannot be used as a combiner
INCORRECT
Notes :
1. Combiners used
2. Wrong since output of combiner
should match with output of
mapper here output of combiner
is pair where as output of mapper
was just list of intergers
3. This breaks the map reduce basic
knowledge
CORRECTLY
Notes:
Correct implementation of combiner since
output of mapper is matching with output of
combiner
What if I don’t use combiner ?
Still the reducer will be able to calculate mean at
the end correctly ; combiner just act as
intermediator to reduce the reducer workload.
Also , output of reducer need not to be same as
that of combiner or mapper.
EFFICIENT THAN ALL OTHER VERSIONS
Notes :
▪ Inside the mapper, the partial sums and counts
associated with each string are held in memory
across input key-value pairs
▪ Intermediate key-value pairs are emitted only after
the entire input split has been processed
▪ In memory combiner is efficiently using the
resources to reach out to the desired results
▪ Workload on Reducer is bit reduced
▪ WE ARE EMMITING SUM AND COUNT TO REACH
THE AVERAGE i.e. associative actions for non
associative (average) result.
▪ The concept of stripes is to aggregate data prior to the Reducers by using a
Combiner.There are several benefits to this, discussed below.When a Mapper
completes, its intermediate data sits idle when pairing until all Mappers are
complete.With striping, the intermediate data is passed to the Combiner, which
can start processing through the data like a Reducer. So, instead of Mappers sitting
idle, they can execute the Combiner, until the slowest Mapper finishes.
▪ Link : http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and-
stripes-explained/
Design patterns in MapReduce
▪ Input to the problem
▪ Key-value pairs in the form of a docid and a doc
▪The mapper
▪ Processes each input document
▪ Emits key-value pairs with:
▪ Each co-occurring word pair as the key
▪ The integer one (the count) as the value
▪ This is done with two nested loops:
▪ The outer loop iterates over all words
▪ The inner loop iterates over all neighbors
▪The reducer:
▪ Receives pairs relative to co-occurring words
▪ This requires modifying the partitioner
▪ Computes an absolute count of the joint event
▪ Emits the pair and the count as the final key-value output
▪ Basically reducers emit the cells of the matrix
Design patterns in MapReduce
▪ Input to the problem
▪ Key-value pairs in the form of a docid and a doc
▪ The mapper:
▪ Same two nested loops structure as before
▪ Co-occurrence information is first stored in an associative array
▪ Emit key-value pairs with words as keys and the corresponding arrays as values
▪ The reducer:
▪ Receives all associative arrays related to the same word
▪ Performs an element-wise sum of all associative arrays with the same key
▪ Emits key-value output in the form of word, associative array
▪ Basically, reducers emit rows of the co-occurrence matrix
Design patterns in MapReduce
▪ Generates a large number of key-value pairs (also intermediate)
▪ The benefit from combiners is limited, as it is less likely for a mapper to process
multiple occurrences of a word
▪ Does not suffer from memory paging problems
▪ More compact
▪ Generates fewer and shorted intermediate keys
▪ Can make better use of combiners
▪ The framework has less sorting to do
▪ The values are more complex and have serialization/deserialization overhead
▪ Greatly benefits from combiners, as the key space is the vocabulary
▪ Suffers from memory paging problems, if not properly engineered
“STRIPES”
▪ Idea: group together pairs into an associative array
▪ Each mapper takes a sentence:
▪ Generate all co-occurring term pairs
▪ For each term, emit a → { b: countb, c: countc, d: countd … }
▪ Reducers perform element-wise sum of associative arrays
(a, b) → 1
(a, c) → 2
(a, d) → 5
(a, e) → 3
(a, f) → 2
a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
a → { b: 1, d: 5, e: 3 }
a → { b: 1, c: 2, d: 2, f: 2 }
a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
+
▪ Combiners can be used both in pair and stripes but the implementation of
combiners in strips gives better result because of the associative array
▪ Stripe approach might encounter the problem of memory since it tries to fit the
associative array into it memory
▪ Pairs approach don’t face such problem w.r.t keeping the in-memory space
▪ STRIPE APPROACH PERFORMED BETTER THAN PAIRS BUT THEY HAVE THEIR
OWN SIGNIFICANCE.
▪ Memory problem in strip can be dealt by dividing the entire vocabulary into
buckets and applying strips on individual buckets this in turn reduces the memory
allocation required for the stripes approach.
ORDER INVERSION
▪ Drawback with co-occurrence matrix is that some words do appear too frequently
together ; simply because one of the words is very common
▪ Solution :
▪ convert absolute counts into relative frequencies, f(w(j) | w(i)).That is, what proportion of
the time does wj appear in the context of wi?
▪ N(·, ·) is the number of times a co-occurring word pair is observed
▪ The denominator is called the marginal (the sum of the counts of the conditioning
variable co-occurring with anything else)
▪ In the reducer, the counts of all words that co-occur with the conditioning variable
(wi) are available in the associative array
▪ Hence, the sum of all those counts gives the marginal
▪ Then we divide the the joint counts by the marginal and we’re done
▪ The reducer receives the pair (wi , wj) and the count
▪ From this information alone it is not possible to compute f(wj|wi)
▪ Fortunately, as for the mapper, also the reducer can preserve state across multiple
keys
▪ We can buffer in memory all the words that co-occur with wi and their counts
▪ This is basically building the associative array in the stripes method
We must define the sort order of the pair
▪ In this way, the keys are first sorted by the left word, and then by the right word (in the
pair)
▪ Hence, we can detect if all pairs associated with the word we are conditioning on (wi)
have been seen
▪ At this point, we can use the in-memory buffer, compute the relative frequencies and emit
We must ensure that all pairs with the same left word are sent to the same reducer. But it cant be done automatically
hence we use separate partitioner to achieve this task . . .
Design patterns in MapReduce
▪ Emit a special key-value pair to capture the marginal
▪ Control the sort order of the intermediate key, so that the special key-value pair is
processed first
▪ Define a custom partitioner for routing intermediate key-value pairs
▪ Preserve state across multiple keys in the reducer
▪ The order inversion pattern is a nice trick that lets a reducer see intermediate results before it
processes the data that generated them.
▪ We illustrate this with the example of computing relative frequencies for co-occurring word pairs
e.g. what are the relative frequencies of words occurring within a small window of the word
"dog"? The mapper counts word pairs in the corpus, so its output looks like..
((dog, cat), 125)
((dog, foot), 246)
▪ But it also keeps a running total of all the word pairs containing "dog", outputting this as ((dog,*),
5348)
▪ Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing
the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*),
5348) first, followed by all the other counts, and can trivially store the total and then output relative
frequencies.
The benefit of the pattern is that it avoids an extra MapReduce
iteration without creating any additional scalability bottleneck.
▪ Input to reducers are sorted by the keys
▪ Values are arbitrarily ordered
▪ We may want to order reducer values either ascending or descending.
▪ Solution :
▪ Buffer reducer values in memory and sort
▪ Disadvantage : is data is too large , it may not fit in memory ; also unnecessary creation of
object space in memory heap
▪ Use secondary sort design pattern in map reduce
▪ Uses shuffle and sort method
▪ Reducer values will be sorted
▪ Secondary key sorting is done by creating a composite key
▪Parallel BFS
▪Page Rank
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce

More Related Content

PPTX
Demand Assigned Multiple Access
PPTX
EC8094 – SATELLITE COMMUNICATION.pptx
PPTX
Satellite ppt
PPT
Mac adhoc
PPS
Satellite communication
PPTX
IEEE 802.11 Architecture and Services
PPTX
Gprs architecture
Demand Assigned Multiple Access
EC8094 – SATELLITE COMMUNICATION.pptx
Satellite ppt
Mac adhoc
Satellite communication
IEEE 802.11 Architecture and Services
Gprs architecture

What's hot (20)

PPTX
Mobile IP
PPTX
CELLULAR COMMUNICATION SYSTEM
PPTX
Vanet Presentation
PPT
Mobile IP
PPT
Introduction & Wireless Transmission
PPT
C11 support for-mobility
PPT
Satellite communication
PPTX
Lecture digital modulation
PPTX
Earth station Parameters
PPT
Lecture 15
PPTX
Node level simulators
PPTX
satellite communication- UNIT-III.pptx
PPTX
Robotics and technologies, Mars - A project for humanity
PDF
Tracking Radar
PPT
Common protocols
PPTX
mobile ad-hoc network (MANET) and its applications
PDF
Satellite Communication Theory
PPTX
Satellite Multiple Access Schemes
PDF
Distance Vector Multicast Routing Protocol (DVMRP) : Combined Presentation
Mobile IP
CELLULAR COMMUNICATION SYSTEM
Vanet Presentation
Mobile IP
Introduction & Wireless Transmission
C11 support for-mobility
Satellite communication
Lecture digital modulation
Earth station Parameters
Lecture 15
Node level simulators
satellite communication- UNIT-III.pptx
Robotics and technologies, Mars - A project for humanity
Tracking Radar
Common protocols
mobile ad-hoc network (MANET) and its applications
Satellite Communication Theory
Satellite Multiple Access Schemes
Distance Vector Multicast Routing Protocol (DVMRP) : Combined Presentation
Ad

Viewers also liked (20)

PPTX
Contaduría, qué es y para qué sirve
PDF
Manual Corporativo de Gabriela Sandoval Graphic Designer
PDF
Taiwan IPv6 Measurement
PDF
segunda opcion
PPTX
Predicting Consumer Behaviour via Hadoop
PPTX
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
DOCX
mapReduce for machine learning
PPTX
Real-time Big Data Analytics: From Deployment to Production
PPT
Daniel Abadi HadoopWorld 2010
PPTX
Predictive Analytics on Big Data. DIY or BUY?
PPTX
Siete pasos para el proceso de selección de personal
PPTX
O messias prometido
PPTX
Population Health Management, Predictive Analytics, Big Data and Text Analytics
PPTX
Decision trees in hadoop
PDF
Evaluating Big Data Predictive Analytics Platforms
PPTX
Big data and Predictive Analytics By : Professor Lili Saghafi
PDF
Seminar_Report_hadoop
PDF
Application of MapReduce in Cloud Computing
PDF
Big Data & Artificial Intelligence
PDF
Predictive Analytics - Big Data & Artificial Intelligence
Contaduría, qué es y para qué sirve
Manual Corporativo de Gabriela Sandoval Graphic Designer
Taiwan IPv6 Measurement
segunda opcion
Predicting Consumer Behaviour via Hadoop
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
mapReduce for machine learning
Real-time Big Data Analytics: From Deployment to Production
Daniel Abadi HadoopWorld 2010
Predictive Analytics on Big Data. DIY or BUY?
Siete pasos para el proceso de selección de personal
O messias prometido
Population Health Management, Predictive Analytics, Big Data and Text Analytics
Decision trees in hadoop
Evaluating Big Data Predictive Analytics Platforms
Big data and Predictive Analytics By : Professor Lili Saghafi
Seminar_Report_hadoop
Application of MapReduce in Cloud Computing
Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
Ad

Similar to Design patterns in MapReduce (20)

PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
PDF
MapReduce Algorithm Design - Parallel Reduce Operations
PPTX
Ch4.mapreduce algorithm design
PDF
Hadoop combiner and partitioner
PDF
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
lec8_ref.pdf
PPTX
Introduction to Map-Reduce in Hadoop.pptx
PPTX
Introduction to Map-Reduce in Hadoop.pptx
PPTX
SN-BDA-MR-Analysis-6.pptx.................
PPTX
map reduce 4..............................
PPTX
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
PDF
Big Data Processing using a AWS Dataset
PPTX
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
PDF
Automatic Synthesis of Combiners in the MapReduce Framework
PDF
Lecture 2 part 3
PDF
Hadoop map reduce concepts
PDF
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
PDF
Introduction to map reduce
PDF
Intro to Map Reduce
PPTX
Hadoop and Mapreduce for .NET User Group
design mapping lecture6-mapreducealgorithmdesign.ppt
MapReduce Algorithm Design - Parallel Reduce Operations
Ch4.mapreduce algorithm design
Hadoop combiner and partitioner
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
lec8_ref.pdf
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptx
SN-BDA-MR-Analysis-6.pptx.................
map reduce 4..............................
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Big Data Processing using a AWS Dataset
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Automatic Synthesis of Combiners in the MapReduce Framework
Lecture 2 part 3
Hadoop map reduce concepts
bigD3_mapReducebigD3_mapReducebigD3_mapReduce.pdf
Introduction to map reduce
Intro to Map Reduce
Hadoop and Mapreduce for .NET User Group

More from Akhilesh Joshi (20)

PPTX
PCA and LDA in machine learning
PPTX
random forest regression
PPTX
decision tree regression
PPTX
support vector regression
PPTX
polynomial linear regression
PPTX
multiple linear regression
PPTX
simple linear regression
PPTX
R square vs adjusted r square
PPTX
PPTX
Grid search (parameter tuning)
PPTX
svm classification
PPTX
knn classification
PPTX
logistic regression with python and R
PPTX
Data preprocessing for Machine Learning with R and Python
PPTX
Design patterns
PPTX
Bastion Host : Amazon Web Services
PPT
Google knowledge graph
DOCX
Machine learning (domingo's paper)
DOC
SoLoMo - Future of Marketing
PPTX
Webcrawler
PCA and LDA in machine learning
random forest regression
decision tree regression
support vector regression
polynomial linear regression
multiple linear regression
simple linear regression
R square vs adjusted r square
Grid search (parameter tuning)
svm classification
knn classification
logistic regression with python and R
Data preprocessing for Machine Learning with R and Python
Design patterns
Bastion Host : Amazon Web Services
Google knowledge graph
Machine learning (domingo's paper)
SoLoMo - Future of Marketing
Webcrawler

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
A comparative analysis of optical character recognition models for extracting...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
Group 1 Presentation -Planning and Decision Making .pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Design patterns in MapReduce

  • 2. ▪Local Aggregation ▪Pairs and Stripes ▪Order inversion ▪Graph algorithms
  • 3. ▪ In Hadoop, intermediate results are written to local disk before being sent over the network. Since network and disk latencies are relatively expensive compared to other operations, reductions in the amount of intermediate data translate into increases in algorithmic efficiency. ▪ In MapReduce, local aggregation of intermediate results is one of the keys to efficient algorithms. ▪ Hence we take help of COMBINER to perform this local aggregation and reduce the intermediate key value pairs that are passed on by Mapper to the Reducer.
  • 5. ▪ Combiners provide a general mechanism within the MapReduce framework to reduce the amount of intermediate data generated by the mappers. ▪ They can be understood as mini-reducers that process the output of mappers. ▪ Combiners aggregate term counts across the documents processed by each map task ▪ CONCLUSION This results in a reduction in the number of intermediate key-value pairs that need to be shuffled across the network ==> from the order of total number of terms in the collection to the order of the number of unique terms in the collection.
  • 6. An associative array (i.e., Map in Java) is introduced inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document, this version emits a key-value pair for each unique term in the document. NOTE : reducer is not Changed !
  • 7. In this case, we initialize an associative array for holding term counts. Since it is possible to preserve state across multiple calls of the Map method (for each input key-value pair), we can continue to accumulate partial term counts in the associative array across multiple documents, and emit key-value pairs only when the mapper has processed all documents.That is emission of intermediate data is deferred until the Close method in the pseudo-code. IN MAPPER COMBINER IMPLEMENTATION
  • 8. ▪ Advantage of using this design pattern is that we will have control over the local aggregation ▪ In mapper combiners should be preferred over actual combiners as using actual combiners creates an overhead for creating and destroying the objects ▪ Combiners does reduce amount of intermediate data but they keep the number of key value pairs as it is as they are outputted from the mapper but that too in aggregated format
  • 9. ▪ Yes ! Using in mapper combiner tweaks the mapper to preserve the state across the documents ▪ This creates the bottleneck as outcome of the algorithm will depend on the type of key value pair it receives we call it as ORDER DEPENDING BUG ! ▪ It is difficult to check the problem when we are dealing with large datasets Another Disadvantage ??  ▪ There is need for memory where there should be sufficient in memory until all the key values pairs are processed (there may be case in our word count example where vocabulary size may exceed the associative array !) ▪ SOLUTION : flush in memory periodically by maintaining some counter.The size of the blocks to be flushed is empirical and is hard to determine.
  • 10. ▪ Problem Statement : Compute a mean of certain Key (say key is alphanumeric employee id and value is salary) ▪ Addition and Multiplication are associative ▪ Division and Subtraction are not associative
  • 11. 1. Computing Average directly in reducer (no combiner) 2. Using Combiners to reduce workload on reducer 3. Using in-memory combiner to increase efficiency of approach 2
  • 12. 1. Computing Average directly in reducer (no combiner) 2. Using Combiners to reduce workload on reducer : DOES NOT WORK 3. Using in-memory combiner to increase efficiency of approach 2 WHY ? Because since average is not associative combiners will calculate average from separate mapper classes and send them to reducer. Now reducer will take these averages and again combine them to into average.This will lead to wrong solution since : AVERAGE(1,2,3,4,5)=AVERAGE(AVERAGE(1,2),AVERAGE(3,4,5))
  • 13. ▪ NOTES : ▪ NO COMBINER USED ▪ AVERAGE IS CALCULATED IN REDUCER ▪ MAPPER USED : IDENTITY MAPPER ▪ This algorithm works but has some problems Problems: 1. requires shuffling all key-value pairs from mappers to reducers across the network 2. reducer cannot be used as a combiner
  • 14. INCORRECT Notes : 1. Combiners used 2. Wrong since output of combiner should match with output of mapper here output of combiner is pair where as output of mapper was just list of intergers 3. This breaks the map reduce basic knowledge
  • 15. CORRECTLY Notes: Correct implementation of combiner since output of mapper is matching with output of combiner What if I don’t use combiner ? Still the reducer will be able to calculate mean at the end correctly ; combiner just act as intermediator to reduce the reducer workload. Also , output of reducer need not to be same as that of combiner or mapper.
  • 16. EFFICIENT THAN ALL OTHER VERSIONS Notes : ▪ Inside the mapper, the partial sums and counts associated with each string are held in memory across input key-value pairs ▪ Intermediate key-value pairs are emitted only after the entire input split has been processed ▪ In memory combiner is efficiently using the resources to reach out to the desired results ▪ Workload on Reducer is bit reduced ▪ WE ARE EMMITING SUM AND COUNT TO REACH THE AVERAGE i.e. associative actions for non associative (average) result.
  • 17. ▪ The concept of stripes is to aggregate data prior to the Reducers by using a Combiner.There are several benefits to this, discussed below.When a Mapper completes, its intermediate data sits idle when pairing until all Mappers are complete.With striping, the intermediate data is passed to the Combiner, which can start processing through the data like a Reducer. So, instead of Mappers sitting idle, they can execute the Combiner, until the slowest Mapper finishes. ▪ Link : http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and- stripes-explained/
  • 19. ▪ Input to the problem ▪ Key-value pairs in the form of a docid and a doc ▪The mapper ▪ Processes each input document ▪ Emits key-value pairs with: ▪ Each co-occurring word pair as the key ▪ The integer one (the count) as the value ▪ This is done with two nested loops: ▪ The outer loop iterates over all words ▪ The inner loop iterates over all neighbors ▪The reducer: ▪ Receives pairs relative to co-occurring words ▪ This requires modifying the partitioner ▪ Computes an absolute count of the joint event ▪ Emits the pair and the count as the final key-value output ▪ Basically reducers emit the cells of the matrix
  • 21. ▪ Input to the problem ▪ Key-value pairs in the form of a docid and a doc ▪ The mapper: ▪ Same two nested loops structure as before ▪ Co-occurrence information is first stored in an associative array ▪ Emit key-value pairs with words as keys and the corresponding arrays as values ▪ The reducer: ▪ Receives all associative arrays related to the same word ▪ Performs an element-wise sum of all associative arrays with the same key ▪ Emits key-value output in the form of word, associative array ▪ Basically, reducers emit rows of the co-occurrence matrix
  • 23. ▪ Generates a large number of key-value pairs (also intermediate) ▪ The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a word ▪ Does not suffer from memory paging problems
  • 24. ▪ More compact ▪ Generates fewer and shorted intermediate keys ▪ Can make better use of combiners ▪ The framework has less sorting to do ▪ The values are more complex and have serialization/deserialization overhead ▪ Greatly benefits from combiners, as the key space is the vocabulary ▪ Suffers from memory paging problems, if not properly engineered
  • 25. “STRIPES” ▪ Idea: group together pairs into an associative array ▪ Each mapper takes a sentence: ▪ Generate all co-occurring term pairs ▪ For each term, emit a → { b: countb, c: countc, d: countd … } ▪ Reducers perform element-wise sum of associative arrays (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +
  • 26. ▪ Combiners can be used both in pair and stripes but the implementation of combiners in strips gives better result because of the associative array ▪ Stripe approach might encounter the problem of memory since it tries to fit the associative array into it memory ▪ Pairs approach don’t face such problem w.r.t keeping the in-memory space ▪ STRIPE APPROACH PERFORMED BETTER THAN PAIRS BUT THEY HAVE THEIR OWN SIGNIFICANCE. ▪ Memory problem in strip can be dealt by dividing the entire vocabulary into buckets and applying strips on individual buckets this in turn reduces the memory allocation required for the stripes approach.
  • 27. ORDER INVERSION ▪ Drawback with co-occurrence matrix is that some words do appear too frequently together ; simply because one of the words is very common ▪ Solution : ▪ convert absolute counts into relative frequencies, f(w(j) | w(i)).That is, what proportion of the time does wj appear in the context of wi? ▪ N(·, ·) is the number of times a co-occurring word pair is observed ▪ The denominator is called the marginal (the sum of the counts of the conditioning variable co-occurring with anything else)
  • 28. ▪ In the reducer, the counts of all words that co-occur with the conditioning variable (wi) are available in the associative array ▪ Hence, the sum of all those counts gives the marginal ▪ Then we divide the the joint counts by the marginal and we’re done
  • 29. ▪ The reducer receives the pair (wi , wj) and the count ▪ From this information alone it is not possible to compute f(wj|wi) ▪ Fortunately, as for the mapper, also the reducer can preserve state across multiple keys ▪ We can buffer in memory all the words that co-occur with wi and their counts ▪ This is basically building the associative array in the stripes method We must define the sort order of the pair ▪ In this way, the keys are first sorted by the left word, and then by the right word (in the pair) ▪ Hence, we can detect if all pairs associated with the word we are conditioning on (wi) have been seen ▪ At this point, we can use the in-memory buffer, compute the relative frequencies and emit We must ensure that all pairs with the same left word are sent to the same reducer. But it cant be done automatically hence we use separate partitioner to achieve this task . . .
  • 31. ▪ Emit a special key-value pair to capture the marginal ▪ Control the sort order of the intermediate key, so that the special key-value pair is processed first ▪ Define a custom partitioner for routing intermediate key-value pairs ▪ Preserve state across multiple keys in the reducer
  • 32. ▪ The order inversion pattern is a nice trick that lets a reducer see intermediate results before it processes the data that generated them. ▪ We illustrate this with the example of computing relative frequencies for co-occurring word pairs e.g. what are the relative frequencies of words occurring within a small window of the word "dog"? The mapper counts word pairs in the corpus, so its output looks like.. ((dog, cat), 125) ((dog, foot), 246) ▪ But it also keeps a running total of all the word pairs containing "dog", outputting this as ((dog,*), 5348) ▪ Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*), 5348) first, followed by all the other counts, and can trivially store the total and then output relative frequencies. The benefit of the pattern is that it avoids an extra MapReduce iteration without creating any additional scalability bottleneck.
  • 33. ▪ Input to reducers are sorted by the keys ▪ Values are arbitrarily ordered ▪ We may want to order reducer values either ascending or descending. ▪ Solution : ▪ Buffer reducer values in memory and sort ▪ Disadvantage : is data is too large , it may not fit in memory ; also unnecessary creation of object space in memory heap ▪ Use secondary sort design pattern in map reduce ▪ Uses shuffle and sort method ▪ Reducer values will be sorted
  • 34. ▪ Secondary key sorting is done by creating a composite key