SlideShare a Scribd company logo
Hadoop

Simple. Scalable.
@markgunnels

mark@catamorphiclabs.com
Java. Clojure. Ruby.

    Cloudera Certified
posscon.org

April 15, 16, and 17
Agenda

 Overview
 Massively Large Data Sets and the problems therein
 Distributed File System
 MapReduce
 Pig
Overview
Doug Cutting

   Genius
Favorite Hadoop Story

     New York Times
4 Terabytes of Source Articles.
24 Hours.
5.5 Terabytes of PDFs.
Did it again.
$240.
Infoporn from Yahoo

 73 hours
 490 TB Shuffling
 280 TB Output
 4000 Nodes
 16 PB Disk Space
 32K Cores
 64 TB RAM
Hadoop solves...
Analyzing Massively Large
        Datasets
Two Problems

You have to distribute.
Data Storage

 Capacity has increased rapidly
 beyond read speeds. Datasets
won't fit on one disk. Tolerate node
               failure.
Data Analysis

  Combine data from many
machines. Tolerate node failure.
How Hadoop solves these
      problems.
Send Code to Data. Not Data
        to Code.
Data Storage

    HDFS
Name Node. Data Nodes.

   Master - Slave Relationship
Shard massive files across
   multiple machines.
       MB, GB, and TB
Tolerant of Node Failure

 Files replicated across at least 3
               nodes.
HDFS behaves like a normal
       file system.
      No true appends yet.
Demonstration.
Data Analysis

  MapReduce
Job Tracker. Task Nodes.

   Master - Slave Relationship.
map
Demonstration
pmap
Demonstration
reduce
Demonstration
(reduce (pmap))
Demonstration.
MapReduce

   Java
Nobody likes it.

       :-)
MapReduce

Ruby. Python. Unix Utilities.
MapReduce

  Clojure
Hadoop Ecosystem

Pigkeeper. Hive. Cascading.
Pig
HBase

More Related Content

PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PPTX
JOSA TechTalks - Big Data on Hadoop
PDF
introduction to data processing using Hadoop and Pig
PPTX
Cassandra + Hadoop @ApacheCon
PPTX
Practical Hadoop using Pig
PPT
Another Intro To Hadoop
PPT
Hadoop Technology
PPTX
Hive and data analysis using pandas
Nov HUG 2009: Hadoop Record Reader In Python
JOSA TechTalks - Big Data on Hadoop
introduction to data processing using Hadoop and Pig
Cassandra + Hadoop @ApacheCon
Practical Hadoop using Pig
Another Intro To Hadoop
Hadoop Technology
Hive and data analysis using pandas

What's hot (19)

PDF
Geek camp
KEY
Getting Started on Hadoop
PDF
Making Big Data, small
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
PPTX
Hadoop: The elephant in the room
PPT
Hadoop training by keylabs
PPT
Hive integration: HBase and Rcfile__HadoopSummit2010
PDF
PPTX
KEY
Intro to Hadoop
PPTX
Bw tech hadoop
PPTX
How To Run Mapreduce Jobs In Python
PPTX
BioPig for scalable analysis of big sequencing data
PPT
Hadoop at Yahoo! -- Hadoop World NY 2009
PDF
Introduction to Hadoop - FinistJug
PDF
How to measure your dataflow using fio, pktgen and bandwidthTest
PDF
9/2017 STL HUG - Back to School
PPTX
Hadoop and big data
Geek camp
Getting Started on Hadoop
Making Big Data, small
Scalable Hadoop with succinct Python: the best of both worlds
Hadoop: The elephant in the room
Hadoop training by keylabs
Hive integration: HBase and Rcfile__HadoopSummit2010
Intro to Hadoop
Bw tech hadoop
How To Run Mapreduce Jobs In Python
BioPig for scalable analysis of big sequencing data
Hadoop at Yahoo! -- Hadoop World NY 2009
Introduction to Hadoop - FinistJug
How to measure your dataflow using fio, pktgen and bandwidthTest
9/2017 STL HUG - Back to School
Hadoop and big data
Ad

Viewers also liked (8)

PDF
Implementing S-Expressions Based Extented Languages in LISP
PPT
JCR Content Management
PDF
Writing Your Own JSR-Compliant, Domain-Specific Scripting Language
PPTX
SharePoint Governance and Lifecycle Management with Project Server 2010
PDF
Why you should be excited about ClojureScript
PPT
Nomenclatura e peças de container
PDF
Functional Programming with Immutable Data Structures
PDF
Clojurescript slides
Implementing S-Expressions Based Extented Languages in LISP
JCR Content Management
Writing Your Own JSR-Compliant, Domain-Specific Scripting Language
SharePoint Governance and Lifecycle Management with Project Server 2010
Why you should be excited about ClojureScript
Nomenclatura e peças de container
Functional Programming with Immutable Data Structures
Clojurescript slides
Ad

Similar to Hadoop - Simple. Scalable. (20)

PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Hands on Hadoop and pig
PDF
Scaling Storage and Computation with Hadoop
PPTX
Hadoop jon
PDF
HadoopThe Hadoop Java Software Framework
PPTX
Presentation sreenu dwh-services
PDF
getFamiliarWithHadoop
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Introduction to Hadoop and Big Data Processing
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
PPTX
Hadoop and big data
PPTX
THE SOLUTION FOR BIG DATA
PPTX
THE SOLUTION FOR BIG DATA
PPT
Introduction to Apache Hadoop
PPT
Presentation
PPTX
Introduction to Apache Hadoop Ecosystem
PPT
Big Data Technologies - Hadoop
PDF
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
PPT
Introduccion a Hadoop / Introduction to Hadoop
EclipseCon Keynote: Apache Hadoop - An Introduction
Hands on Hadoop and pig
Scaling Storage and Computation with Hadoop
Hadoop jon
HadoopThe Hadoop Java Software Framework
Presentation sreenu dwh-services
getFamiliarWithHadoop
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hadoop_EcoSystem slide by CIDAC India.pptx
Introduction to Hadoop and Big Data Processing
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Hadoop and big data
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Introduction to Apache Hadoop
Presentation
Introduction to Apache Hadoop Ecosystem
Big Data Technologies - Hadoop
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Introduccion a Hadoop / Introduction to Hadoop

More from elliando dias (20)

PDF
Geometria Projetiva
PDF
Polyglot and Poly-paradigm Programming for Better Agility
PDF
Javascript Libraries
PDF
How to Make an Eight Bit Computer and Save the World!
PDF
Ragel talk
PDF
A Practical Guide to Connecting Hardware to the Web
PDF
Introdução ao Arduino
PDF
Minicurso arduino
PDF
Incanter Data Sorcery
PDF
PDF
Fab.in.a.box - Fab Academy: Machine Design
PDF
The Digital Revolution: Machines that makes
PDF
Hadoop + Clojure
PDF
Hadoop and Hive Development at Facebook
PDF
Multi-core Parallelization in Clojure - a Case Study
PDF
From Lisp to Clojure/Incanter and RAn Introduction
PDF
FleetDB A Schema-Free Database in Clojure
PDF
Clojure and The Robot Apocalypse
PDF
Clojure - A new Lisp
PDF
Clojure - An Introduction for Lisp Programmers
Geometria Projetiva
Polyglot and Poly-paradigm Programming for Better Agility
Javascript Libraries
How to Make an Eight Bit Computer and Save the World!
Ragel talk
A Practical Guide to Connecting Hardware to the Web
Introdução ao Arduino
Minicurso arduino
Incanter Data Sorcery
Fab.in.a.box - Fab Academy: Machine Design
The Digital Revolution: Machines that makes
Hadoop + Clojure
Hadoop and Hive Development at Facebook
Multi-core Parallelization in Clojure - a Case Study
From Lisp to Clojure/Incanter and RAn Introduction
FleetDB A Schema-Free Database in Clojure
Clojure and The Robot Apocalypse
Clojure - A new Lisp
Clojure - An Introduction for Lisp Programmers

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Spectroscopy.pptx food analysis technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
A comparative analysis of optical character recognition models for extracting...
Machine learning based COVID-19 study performance prediction
1. Introduction to Computer Programming.pptx
Machine Learning_overview_presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Getting Started with Data Integration: FME Form 101
Programs and apps: productivity, graphics, security and other tools
gpt5_lecture_notes_comprehensive_20250812015547.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
Spectroscopy.pptx food analysis technology
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”

Hadoop - Simple. Scalable.