SlideShare a Scribd company logo
Distributed Computing … and introduction of Hadoop -  Ankit Minocha
Outline Introduction to Distributed Computing Parallel vs Distributed Computing.
Computer Speedup Moore’s Law: “ The density of transistors on a chip doubles every 18 months, for the same cost”  (1965)
Scope of problems What can you do with 1 computer? What can you do with 100 computers? What can you do with an entire data center?
Rendering multiple frames of high-quality animation
Parallelization Idea Parallelization is “easy” if processing can be cleanly split into n units:
Parallelization Idea (2) In a parallel computation, we would like to have as many threads as we have processors. e.g., a four-processor computer would be able to run four threads at the same time.
Parallelization Idea (3)
Parallelization Idea (4)
Parallelization Pitfalls But this model is too simple!  How do we assign work units to worker threads? What if we have more work units than threads? How do we aggregate the results at the end? How do we know all the workers have finished? What if the work cannot be divided into completely separate tasks? What is the common theme of all of these problems?
Parallelization Pitfalls (2) Each of these problems represents a point at which multiple threads must communicate with one another, or access a shared resource. Golden rule: Any memory that can be used by multiple threads must have an associated  synchronization system !
What is Wrong With This? Thread 1: void foo() { x++; y = x; } Thread 2: void bar() { y++; x+=3; } If the initial state is y = 0, x = 6, what happens after these threads finish running?
Multithreaded = Unpredictability When we run a multithreaded program, we don’t know what order threads run in, nor do we know when they will interrupt one another. Thread 1: void foo() { eax = mem[x]; inc eax; mem[x] = eax; ebx = mem[x]; mem[y] = ebx; } Thread 2: void bar() { eax = mem[y]; inc eax; mem[y] = eax; eax = mem[x]; add eax, 3; mem[x] = eax; } Many things that look like “one step” operations actually take several steps under the hood:
Multithreaded = Unpredictability This applies to more than just integers: Pulling work units from a queue. Reporting work back to master unit Telling another thread that it can begin the “next phase” of processing eg. Torrent leechers, seeders …  All require synchronization!
Synchronization Primitives A  synchronization primitive  is a special shared variable that guarantees that it can only be accessed  atomically .  Hardware support guarantees that operations on synchronization primitives only ever take one step
Semaphores A semaphore is a flag that can be raised or lowered in one step Semaphores were flags that railroad engineers would use when entering a shared track Only one side of the semaphore can ever be red! (Can both be green?)
Semaphores set() and reset() can be thought of as lock() and unlock() Calls to lock() when the semaphore is already locked cause the thread to  block . Pitfalls: Must “bind” semaphores to particular objects; must remember to unlock correctly
The “corrected” example Thread 1: void foo() { sem.lock(); x++; y = x; sem.unlock(); } Thread 2: void bar() { sem.lock(); y++; x+=3; sem.unlock(); } Global var “Semaphore sem = new Semaphore();” guards access to x & y
Condition Variables A condition variable notifies threads that a particular condition has been met  Inform another thread that a queue now contains elements to pull from (or that it’s empty – request more elements!) Pitfall: What if nobody’s listening?
The final example Thread 1: void foo() { sem.lock(); x++; y = x; fooDone = true; sem.unlock(); fooFinishedCV.notify(); } Thread 2: void bar() { sem.lock(); if(!fooDone) fooFinishedCV.wait(sem); y++; x+=3; sem.unlock(); } Global vars: Semaphore sem = new Semaphore(); ConditionVar fooFinishedCV = new ConditionVar(); boolean fooDone = false;
Too Much Synchronization? Deadlock Synchronization becomes even more complicated when multiple locks can be used Can cause entire system to “get stuck” Thread A: semaphore1.lock(); semaphore2.lock(); /* use data guarded by  semaphores */ semaphore1.unlock();  semaphore2.unlock(); Thread B: semaphore2.lock(); semaphore1.lock(); /* use data guarded by  semaphores */ semaphore1.unlock();  semaphore2.unlock(); (Image: RPI CSCI.4210 Operating Systems notes)
The Moral: Be Careful! Synchronization is hard Need to consider all possible shared state Must keep locks organized and use them consistently and correctly Knowing there are bugs may be tricky; fixing them can be even worse! Keeping shared state to a minimum reduces total system complexity
Hadoop to the rescue!
Prelude to MapReduce We saw earlier that explicit parallelism/synchronization is hard Synchronization does not even answer questions specific to distributed computing, like how to move data from one machine to another Fortunately, MapReduce handles this for us
Prelude to MapReduce MapReduce is a paradigm designed by Google for making a subset (albeit a large one) of distributed problems easier to code Automates data distribution & result aggregation  Restricts the ways data can interact to eliminate locks (no shared state = no locks!)
Thank you so much “Clickable” for your sincere efforts.

More Related Content

PPT
Introduction & Parellelization on large scale clusters
PPTX
How Neural Networks Think
PDF
CUDA by Example : Advanced Atomics : Notes
PPTX
Presentation overview of neural & kernel based clustering
PPTX
Николай Папирный Тема: "Java memory model для простых смертных"
PDF
Guaranteeing Memory Safety in Rust
PDF
Rust Synchronization Primitives
PPTX
作業系統
Introduction & Parellelization on large scale clusters
How Neural Networks Think
CUDA by Example : Advanced Atomics : Notes
Presentation overview of neural & kernel based clustering
Николай Папирный Тема: "Java memory model для простых смертных"
Guaranteeing Memory Safety in Rust
Rust Synchronization Primitives
作業系統

Similar to Distributed computing presentation (20)

PPT
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
PPT
Introduction to Cluster Computing and Map Reduce (from Google)
PPT
Lec1 Intro
PPT
Lec1 Intro
PPTX
84694646456445645645645665656465464sdd.pptx
PDF
Multithreading Introduction and Lifecyle of thread
PPTX
PDF
Low Level Exploits
PPT
cs2110Concurrency1.ppt
PPT
what every web and app developer should know about multithreading
PPTX
chap 7 : Threads (scjp/ocjp)
PPT
[CCC-28c3] Post Memory Corruption Memory Analysis
PPT
Java concurrency
ODP
Multithreading 101
PDF
Programming Language Memory Models: What do Shared Variables Mean?
PDF
CS844 U1 Individual Project
PPTX
Medical Image Processing Strategies for multi-core CPUs
PPTX
Multi-threaded Programming in JAVA
PPTX
04 threads-pbl-2-slots
PPTX
04 threads-pbl-2-slots
Google: Cluster computing and MapReduce: Introduction to Distributed System D...
Introduction to Cluster Computing and Map Reduce (from Google)
Lec1 Intro
Lec1 Intro
84694646456445645645645665656465464sdd.pptx
Multithreading Introduction and Lifecyle of thread
Low Level Exploits
cs2110Concurrency1.ppt
what every web and app developer should know about multithreading
chap 7 : Threads (scjp/ocjp)
[CCC-28c3] Post Memory Corruption Memory Analysis
Java concurrency
Multithreading 101
Programming Language Memory Models: What do Shared Variables Mean?
CS844 U1 Individual Project
Medical Image Processing Strategies for multi-core CPUs
Multi-threaded Programming in JAVA
04 threads-pbl-2-slots
04 threads-pbl-2-slots
Ad

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPT
Teaching material agriculture food technology
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Getting Started with Data Integration: FME Form 101
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A comparative study of natural language inference in Swahili using monolingua...
TLE Review Electricity (Electricity).pptx
Building Integrated photovoltaic BIPV_UPV.pdf
cloud_computing_Infrastucture_as_cloud_p
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Teaching material agriculture food technology
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Getting Started with Data Integration: FME Form 101
Ad

Distributed computing presentation

  • 1. Distributed Computing … and introduction of Hadoop - Ankit Minocha
  • 2. Outline Introduction to Distributed Computing Parallel vs Distributed Computing.
  • 3. Computer Speedup Moore’s Law: “ The density of transistors on a chip doubles every 18 months, for the same cost” (1965)
  • 4. Scope of problems What can you do with 1 computer? What can you do with 100 computers? What can you do with an entire data center?
  • 5. Rendering multiple frames of high-quality animation
  • 6. Parallelization Idea Parallelization is “easy” if processing can be cleanly split into n units:
  • 7. Parallelization Idea (2) In a parallel computation, we would like to have as many threads as we have processors. e.g., a four-processor computer would be able to run four threads at the same time.
  • 10. Parallelization Pitfalls But this model is too simple! How do we assign work units to worker threads? What if we have more work units than threads? How do we aggregate the results at the end? How do we know all the workers have finished? What if the work cannot be divided into completely separate tasks? What is the common theme of all of these problems?
  • 11. Parallelization Pitfalls (2) Each of these problems represents a point at which multiple threads must communicate with one another, or access a shared resource. Golden rule: Any memory that can be used by multiple threads must have an associated synchronization system !
  • 12. What is Wrong With This? Thread 1: void foo() { x++; y = x; } Thread 2: void bar() { y++; x+=3; } If the initial state is y = 0, x = 6, what happens after these threads finish running?
  • 13. Multithreaded = Unpredictability When we run a multithreaded program, we don’t know what order threads run in, nor do we know when they will interrupt one another. Thread 1: void foo() { eax = mem[x]; inc eax; mem[x] = eax; ebx = mem[x]; mem[y] = ebx; } Thread 2: void bar() { eax = mem[y]; inc eax; mem[y] = eax; eax = mem[x]; add eax, 3; mem[x] = eax; } Many things that look like “one step” operations actually take several steps under the hood:
  • 14. Multithreaded = Unpredictability This applies to more than just integers: Pulling work units from a queue. Reporting work back to master unit Telling another thread that it can begin the “next phase” of processing eg. Torrent leechers, seeders … All require synchronization!
  • 15. Synchronization Primitives A synchronization primitive is a special shared variable that guarantees that it can only be accessed atomically . Hardware support guarantees that operations on synchronization primitives only ever take one step
  • 16. Semaphores A semaphore is a flag that can be raised or lowered in one step Semaphores were flags that railroad engineers would use when entering a shared track Only one side of the semaphore can ever be red! (Can both be green?)
  • 17. Semaphores set() and reset() can be thought of as lock() and unlock() Calls to lock() when the semaphore is already locked cause the thread to block . Pitfalls: Must “bind” semaphores to particular objects; must remember to unlock correctly
  • 18. The “corrected” example Thread 1: void foo() { sem.lock(); x++; y = x; sem.unlock(); } Thread 2: void bar() { sem.lock(); y++; x+=3; sem.unlock(); } Global var “Semaphore sem = new Semaphore();” guards access to x & y
  • 19. Condition Variables A condition variable notifies threads that a particular condition has been met Inform another thread that a queue now contains elements to pull from (or that it’s empty – request more elements!) Pitfall: What if nobody’s listening?
  • 20. The final example Thread 1: void foo() { sem.lock(); x++; y = x; fooDone = true; sem.unlock(); fooFinishedCV.notify(); } Thread 2: void bar() { sem.lock(); if(!fooDone) fooFinishedCV.wait(sem); y++; x+=3; sem.unlock(); } Global vars: Semaphore sem = new Semaphore(); ConditionVar fooFinishedCV = new ConditionVar(); boolean fooDone = false;
  • 21. Too Much Synchronization? Deadlock Synchronization becomes even more complicated when multiple locks can be used Can cause entire system to “get stuck” Thread A: semaphore1.lock(); semaphore2.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); Thread B: semaphore2.lock(); semaphore1.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); (Image: RPI CSCI.4210 Operating Systems notes)
  • 22. The Moral: Be Careful! Synchronization is hard Need to consider all possible shared state Must keep locks organized and use them consistently and correctly Knowing there are bugs may be tricky; fixing them can be even worse! Keeping shared state to a minimum reduces total system complexity
  • 23. Hadoop to the rescue!
  • 24. Prelude to MapReduce We saw earlier that explicit parallelism/synchronization is hard Synchronization does not even answer questions specific to distributed computing, like how to move data from one machine to another Fortunately, MapReduce handles this for us
  • 25. Prelude to MapReduce MapReduce is a paradigm designed by Google for making a subset (albeit a large one) of distributed problems easier to code Automates data distribution & result aggregation Restricts the ways data can interact to eliminate locks (no shared state = no locks!)
  • 26. Thank you so much “Clickable” for your sincere efforts.

Editor's Notes

  • #13: There are multiple possible final states. Y is definitely a problem, because we don’t know if it will be “1” or “7”… but X can also be 7, 10, or 11!
  • #14: Inform students that the term we want here is “race condition”
  • #19: Ask: are there still any problems? (Yes – we still have two possible outcomes. We want some other system that allows us to serialize access on an event.)
  • #20: Go over wait() / notify() / broadcast() --- must be combined with a semaphore!