SlideShare a Scribd company logo
CSE509: Introduction to Web Science and TechnologyLecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduceMuhammad AtifQureshiWeb Science Research GroupInstitute of Business Administration (IBA)
Last Time…Search Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 30, 2011
TodayWeb Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyJuly 30, 2011
IntroductionWeb data sets can be very large Tens to hundreds of terabytesCannot mine on a single server (why?)“Big data” is a fact on the World Wide WebLarger data implies effective algorithmsWeb-scale processing: Data-intensive processingAlso applies to startups and niche playersJuly 30, 2011
How Much Data?Google processes 20 PB a day (2008)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s LHC will generate 15 PB a year (??)July 30, 2011
Cluster ArchitectureJuly 30, 2011CPUCPUCPUCPUMemMemMemMemDiskDiskDiskDisk2-10 Gbps backbone between racks1 Gbps between any pair of nodesin a rackSwitchSwitchSwitch……Each rack contains 16-64 nodes
ConcernsIf we had to abort and restart the computation every time one component fails, then the computation might never complete successfullyIf one node fails, all its files would be unavailable until the node is replacedCan also lead to permanent loss of filesJuly 30, 2011Solutions: MapReduce and Google File system
PART I: MapReduceJuly 30, 2011
Major IdeasScale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machinesMove processing to the dataCluster have limited bandwidthProcess data sequentially, avoid random accessSeeks are expensive, disk throughput is reasonableSeamless scalabilityFrom the traditional mythical man-month approach to a newly known phenomenon tradable machine-hourTwenty-one chicken together cannot make an egg hatch in a dayJuly 30, 2011
Traditional Parallelization: Divide and ConquerJuly 30, 2011“Work”Partitionw1w2w3“worker”“worker”“worker”r1r2r3Combine“Result”
Parallelization ChallengesHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?July 30, 2011
Common ThemeParallelization problems arise from:Communication between workers (e.g., to exchange state)Access to shared resources (e.g., data)Thus, we need a synchronization mechanismJuly 30, 2011
Parallelization is HardTraditionally, concurrency is difficult to reason about (uni to small-scale architecture)Concurrency is even more difficult to reason aboutAt the scale of datacenters (even across datacenters)In the presence of failuresIn terms of multiple interacting servicesNot to mention debugging…The reality:Write your own dedicated library, then program with itBurden on the programmer to explicitly manage everythingJuly 30, 2011
Solution: MapReduceProgramming model for expressing distributed computations at a massive scaleHides system-level details from the developersNo more race conditions, lock contention, etc.Separating the what from howDeveloper specifies the computation that needs to be performedExecution framework (“runtime”) handles actual executionJuly 30, 2011
What is MapReduce Used For?At Google:Index building for Google SearchArticle clustering for Google NewsStatistical machine translationAt Yahoo!:Index building for Yahoo! SearchSpam detection for Yahoo! MailAt Facebook:Data miningAd optimizationSpam detectionJuly 30, 2011
Typical MapReduce ExecutionIterate over a large number of recordsExtract something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputMapReduceKey idea: provide a functional abstraction for these two operations(Dean and Ghemawat, OSDI 2004)
MapReduce BasicsProgrammers specify two functions:map (k, v) -> <k’, v’>*reduce (k’, v’) -> <k’, v’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…July 30, 2011
Warm Up Example: Word CountWe have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLsJuly 30, 2011
Word Count (2)Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –cJuly 30, 2011
Word Count (3)To make it slightly harder, suppose we have a large corpus of documentsCount the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -cwhere words takes a file and outputs the words in it, one to a lineThe above captures the essence of MapReduceGreat thing is it is naturally parallelizableJuly 30, 2011
Word Count using MapReduceJuly 30, 2011map(key, value):// key: document name; value: text of document	for each word w in value:		emit(w, 1)reduce(key, values):// key: a word; values: an iterator over counts	result = 0	for each count v in values:		result += v	emit(key,result)
Word Count IllustrationJuly 30, 2011map(key=url, val=contents):For each word w in contents, emit (w, “1”)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”see	1bob	1 run	1see 	1spot 	1throw	1bob	1 run	1see 	2spot 	1throw	1see bob runsee spot throw
Implementation Overview100s/1000s of 2-CPU x86 machines, 2-4 GB of memoryLimited bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines July 30, 2011Implementation at Google is a C++ library linked to user programs
Distributed Execution OverviewJuly 30, 2011UserProgram(1) submitMaster(2) schedule map(2) schedule reduceworkersplit 0(6) writeoutputfile 0(5) remote readworkersplit 1(3) readsplit 2(4) local writeworkersplit 3outputfile 1split 4workerworkerInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfilesAdapted from (Dean and Ghemawat, OSDI 2004)
MapReduce ImplementationsGoogle has a proprietary implementation in C++Bindings in Java, PythonHadoop is an open-source implementation in JavaDevelopment led by Yahoo, used in productionNow an Apache projectRapidly expanding software ecosystemLots of custom research implementationsFor GPUs, cell processors, etc.July 30, 2011
Bonus AssignmentWrite MapReduce version of Assignment no. 2July 30, 2011
MapReduce in VisionerBOTJuly 30, 2011
VisionerBOT Distributed DesignJuly 30, 2011
PART II: Google File SystemJuly 30, 2011
Distributed File SystemDon’t move data to workers… move workers to the data!Store data on the local disks of nodes in the clusterStart up the workers on the node that has the data localWhy?Not enough RAM to hold all the data in memoryDisk access is slow, but disk throughput is reasonableA distributed file system is the answerGFS (Google File System) for Google’s MapReduceHDFS (Hadoop Distributed File System) for Hadoop
GFS: AssumptionsCommodity hardware over “exotic” hardwareScale “out”, not “up”High component failure ratesInexpensive commodity components fail all the time“Modest” number of huge filesMulti-gigabyte files are common, if not encouragedFiles are write-once, mostly appended toPerhaps concurrentlyLarge streaming reads over random accessHigh sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design DecisionsFiles stored as chunksFixed size (64MB)Reliability through replicationEach chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadataSimple centralized managementNo data cachingLittle benefit due to large datasets, streaming readsSimplify the APIPush some of the issues onto the client (e.g., data layout)HDFS = GFS clone (same basic ideas)
QUESTIONS?July 30, 2011
Ad

Recommended

Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
Wes McKinney
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Ijircce publish this paper
Ijircce publish this paper
SANTOSH WAYAL
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
Anubhav Jain
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Motivation
Motivation
Rachit Pande
 
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Carlos Castillo (ChaTo)
 
Search engine optimization
Search engine optimization
Naga Gopinath
 
Link Analysis (RBY)
Link Analysis (RBY)
Carlos Castillo (ChaTo)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Link Analysis
Link Analysis
marco larco
 
Link analysis for web search
Link analysis for web search
Emrullah Delibas
 
Link Analysis
Link Analysis
Carlos Castillo (ChaTo)
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the Terrorists
James McGivern
 
Analysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websites
Shahid Beheshti University
 
Link analysis
Link analysis
R A Akerkar
 
Link analysis .. Data Mining
Link analysis .. Data Mining
Mustafa Salam
 
Hadoop
Hadoop
Raghu Juluri
 
Introduction To Map Reduce
Introduction To Map Reduce
rantav
 
Map reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Lecture2-MapReduce - An introductory lecture to Map Reduce
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Behm Shah Pagerank
Behm Shah Pagerank
gothicane
 
This gives a brief detail about big data
This gives a brief detail about big data
chinky1118
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
MapReduce.pptx
MapReduce.pptx
AtulYadav218546
 

More Related Content

Viewers also liked (12)

Motivation
Motivation
Rachit Pande
 
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Carlos Castillo (ChaTo)
 
Search engine optimization
Search engine optimization
Naga Gopinath
 
Link Analysis (RBY)
Link Analysis (RBY)
Carlos Castillo (ChaTo)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Link Analysis
Link Analysis
marco larco
 
Link analysis for web search
Link analysis for web search
Emrullah Delibas
 
Link Analysis
Link Analysis
Carlos Castillo (ChaTo)
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the Terrorists
James McGivern
 
Analysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websites
Shahid Beheshti University
 
Link analysis
Link analysis
R A Akerkar
 
Link analysis .. Data Mining
Link analysis .. Data Mining
Mustafa Salam
 
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Carlos Castillo (ChaTo)
 
Search engine optimization
Search engine optimization
Naga Gopinath
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
Kira
 
Link analysis for web search
Link analysis for web search
Emrullah Delibas
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the Terrorists
James McGivern
 
Analysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websites
Shahid Beheshti University
 
Link analysis .. Data Mining
Link analysis .. Data Mining
Mustafa Salam
 

Similar to CSE509 Lecture 4 (20)

Hadoop
Hadoop
Raghu Juluri
 
Introduction To Map Reduce
Introduction To Map Reduce
rantav
 
Map reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Lecture2-MapReduce - An introductory lecture to Map Reduce
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Behm Shah Pagerank
Behm Shah Pagerank
gothicane
 
This gives a brief detail about big data
This gives a brief detail about big data
chinky1118
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
MapReduce.pptx
MapReduce.pptx
AtulYadav218546
 
Training
Training
Doug Chang
 
MapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 
mapreduce and hadoop Distributed File sysytem
mapreduce and hadoop Distributed File sysytem
imandoumi
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Bigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 
Introduction To Map Reduce
Introduction To Map Reduce
rantav
 
Lecture2-MapReduce - An introductory lecture to Map Reduce
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Behm Shah Pagerank
Behm Shah Pagerank
gothicane
 
This gives a brief detail about big data
This gives a brief detail about big data
chinky1118
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 
mapreduce and hadoop Distributed File sysytem
mapreduce and hadoop Distributed File sysytem
imandoumi
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Bigdata processing with Spark
Bigdata processing with Spark
Arjen de Vries
 
Ad

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for Pakistan
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
Information Retrieval
Information Retrieval
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
Social Media Mining and Analytics
Social Media Mining and Analytics
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
CSE509 Lecture 6
CSE509 Lecture 6
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
CSE509 Lecture 5
CSE509 Lecture 5
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
CSE509 Lecture 3
CSE509 Lecture 3
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
CSE509 Lecture 2
CSE509 Lecture 2
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
CSE509 Lecture 1
CSE509 Lecture 1
Web Science Research Group at Institute of Business Administration, Karachi, Pakistan
 
Ad

Recently uploaded (20)

The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 

CSE509 Lecture 4

  • 1. CSE509: Introduction to Web Science and TechnologyLecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduceMuhammad AtifQureshiWeb Science Research GroupInstitute of Business Administration (IBA)
  • 2. Last Time…Search Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 30, 2011
  • 3. TodayWeb Data ExplosionPart IMapReduce BasicsMapReduce Example and DetailsMapReduce Case-Study: Web Crawler based on MapReduce ArchitecturePart IILarge-Scale File SystemsGoogle File System Case-StudyJuly 30, 2011
  • 4. IntroductionWeb data sets can be very large Tens to hundreds of terabytesCannot mine on a single server (why?)“Big data” is a fact on the World Wide WebLarger data implies effective algorithmsWeb-scale processing: Data-intensive processingAlso applies to startups and niche playersJuly 30, 2011
  • 5. How Much Data?Google processes 20 PB a day (2008)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERN’s LHC will generate 15 PB a year (??)July 30, 2011
  • 6. Cluster ArchitectureJuly 30, 2011CPUCPUCPUCPUMemMemMemMemDiskDiskDiskDisk2-10 Gbps backbone between racks1 Gbps between any pair of nodesin a rackSwitchSwitchSwitch……Each rack contains 16-64 nodes
  • 7. ConcernsIf we had to abort and restart the computation every time one component fails, then the computation might never complete successfullyIf one node fails, all its files would be unavailable until the node is replacedCan also lead to permanent loss of filesJuly 30, 2011Solutions: MapReduce and Google File system
  • 9. Major IdeasScale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machinesMove processing to the dataCluster have limited bandwidthProcess data sequentially, avoid random accessSeeks are expensive, disk throughput is reasonableSeamless scalabilityFrom the traditional mythical man-month approach to a newly known phenomenon tradable machine-hourTwenty-one chicken together cannot make an egg hatch in a dayJuly 30, 2011
  • 10. Traditional Parallelization: Divide and ConquerJuly 30, 2011“Work”Partitionw1w2w3“worker”“worker”“worker”r1r2r3Combine“Result”
  • 11. Parallelization ChallengesHow do we assign work units to workers?What if we have more work units than workers?What if workers need to share partial results?How do we aggregate partial results?How do we know all the workers have finished?What if workers die?July 30, 2011
  • 12. Common ThemeParallelization problems arise from:Communication between workers (e.g., to exchange state)Access to shared resources (e.g., data)Thus, we need a synchronization mechanismJuly 30, 2011
  • 13. Parallelization is HardTraditionally, concurrency is difficult to reason about (uni to small-scale architecture)Concurrency is even more difficult to reason aboutAt the scale of datacenters (even across datacenters)In the presence of failuresIn terms of multiple interacting servicesNot to mention debugging…The reality:Write your own dedicated library, then program with itBurden on the programmer to explicitly manage everythingJuly 30, 2011
  • 14. Solution: MapReduceProgramming model for expressing distributed computations at a massive scaleHides system-level details from the developersNo more race conditions, lock contention, etc.Separating the what from howDeveloper specifies the computation that needs to be performedExecution framework (“runtime”) handles actual executionJuly 30, 2011
  • 15. What is MapReduce Used For?At Google:Index building for Google SearchArticle clustering for Google NewsStatistical machine translationAt Yahoo!:Index building for Yahoo! SearchSpam detection for Yahoo! MailAt Facebook:Data miningAd optimizationSpam detectionJuly 30, 2011
  • 16. Typical MapReduce ExecutionIterate over a large number of recordsExtract something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputMapReduceKey idea: provide a functional abstraction for these two operations(Dean and Ghemawat, OSDI 2004)
  • 17. MapReduce BasicsProgrammers specify two functions:map (k, v) -> <k’, v’>*reduce (k’, v’) -> <k’, v’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…July 30, 2011
  • 18. Warm Up Example: Word CountWe have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLsJuly 30, 2011
  • 19. Word Count (2)Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –cJuly 30, 2011
  • 20. Word Count (3)To make it slightly harder, suppose we have a large corpus of documentsCount the number of times each distinct word occurs in the corpuswords(docs/*) | sort | uniq -cwhere words takes a file and outputs the words in it, one to a lineThe above captures the essence of MapReduceGreat thing is it is naturally parallelizableJuly 30, 2011
  • 21. Word Count using MapReduceJuly 30, 2011map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)reduce(key, values):// key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)
  • 22. Word Count IllustrationJuly 30, 2011map(key=url, val=contents):For each word w in contents, emit (w, “1”)reduce(key=word, values=uniq_counts):Sum all “1”s in values listEmit result “(word, sum)”see 1bob 1 run 1see 1spot 1throw 1bob 1 run 1see 2spot 1throw 1see bob runsee spot throw
  • 23. Implementation Overview100s/1000s of 2-CPU x86 machines, 2-4 GB of memoryLimited bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines July 30, 2011Implementation at Google is a C++ library linked to user programs
  • 24. Distributed Execution OverviewJuly 30, 2011UserProgram(1) submitMaster(2) schedule map(2) schedule reduceworkersplit 0(6) writeoutputfile 0(5) remote readworkersplit 1(3) readsplit 2(4) local writeworkersplit 3outputfile 1split 4workerworkerInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfilesAdapted from (Dean and Ghemawat, OSDI 2004)
  • 25. MapReduce ImplementationsGoogle has a proprietary implementation in C++Bindings in Java, PythonHadoop is an open-source implementation in JavaDevelopment led by Yahoo, used in productionNow an Apache projectRapidly expanding software ecosystemLots of custom research implementationsFor GPUs, cell processors, etc.July 30, 2011
  • 26. Bonus AssignmentWrite MapReduce version of Assignment no. 2July 30, 2011
  • 29. PART II: Google File SystemJuly 30, 2011
  • 30. Distributed File SystemDon’t move data to workers… move workers to the data!Store data on the local disks of nodes in the clusterStart up the workers on the node that has the data localWhy?Not enough RAM to hold all the data in memoryDisk access is slow, but disk throughput is reasonableA distributed file system is the answerGFS (Google File System) for Google’s MapReduceHDFS (Hadoop Distributed File System) for Hadoop
  • 31. GFS: AssumptionsCommodity hardware over “exotic” hardwareScale “out”, not “up”High component failure ratesInexpensive commodity components fail all the time“Modest” number of huge filesMulti-gigabyte files are common, if not encouragedFiles are write-once, mostly appended toPerhaps concurrentlyLarge streaming reads over random accessHigh sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 32. GFS: Design DecisionsFiles stored as chunksFixed size (64MB)Reliability through replicationEach chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadataSimple centralized managementNo data cachingLittle benefit due to large datasets, streaming readsSimplify the APIPush some of the issues onto the client (e.g., data layout)HDFS = GFS clone (same basic ideas)

Editor's Notes

  • #7: 2 In traditional high-performance computing (HPC) applications (e.g.,for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes”and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloadsare not very processor-demanding, which means that the separation of compute and storage createsa bottleneck in the network. As an alternative to moving data around, it is more efficient to movethe processing around. That is, MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup, we can take advantage of data locality by running code on theprocessor directly attached to the block of data we need. The distributed file system is responsiblefor managing the data over which MapReduce operates.3 Data-intensive processing by definition meansthat the relevant datasets are too large to fit in memory and must be held on disk. Seek times forrandom disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoidrandom data access, and instead organize computations so that data are processed sequentially. Asimple scenario10 poignantly illustrates the large performance gap between sequential operationsand random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonableassumptions about disk latency and throughput, a back-of-the-envelop calculation will show thatupdating 1% of the records (by accessing and then mutating each record) will take about a monthon a single machine. On the other hand, if one simply reads the entire database and rewrites allthe records (mutating those that need updating), the process would finish in under a work day ona single machine. Sequential data access is, literally, orders of magnitude faster than random dataaccess.11The development of solid-state drives is unlikely to change this balance for at least tworeasons. First, the cost differential between traditional magnetic disks and solid-state disks remainssubstantial: large-data will for the most part remain on mechanical drives, at least in the nearfuture. Second, although solid-state disks have substantially faster seek times, order-of-magnitudedifferences in performance between sequential and random access still remain.MapReduce is primarily designed for batch processing over large datasets. To the extentpossible, all computations are organized into long streaming operations that take advantage of theaggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly tradelatency for throughput.