SlideShare a Scribd company logo
Apache Hadoop, MapReduce
           &
     Windows Azure

   Guðmundur Jón Halldórsson
        Five Degrees
          July 2012
Web crawler! „No this isn‘t about that“
What is Hadoop?

System for processing
mind-boggingly large
amount of data
Hadoop


Map-Reduce = Computation
  HDFS     = Storage
HDFS
Hadoop Distributed File System

Yes it is file system written in Java 
And you can do normal file system operations
like [ls, mkdir, ...].

Works best with large files. HDFS splits file into
blocks of 128 MB (can be configures)
HDFS
HDFS will keep 3 copies of each block
The NameNode tracks blocks and datanodes


  DN1   DN2   DN3
                        NN


  DN4   DN5   DN5
                     Namenode
                       DN1, DN4, DN7
                       DN3, DN5, DN8
  DN5   DN8   DN9      DN3, DN4, DN5
Map-Reduce
• Write a mapper that takes a key and value,
  emits zero or more new keys and values
• Write a reducer all the values of one key and
  emits zero or more new keys and values
Map-Reduce JS example
var map = function ( key, value, context ) {
    var words = value.split(/[^a-zA-Z]/);
    for ( var i=0; i < words.length; i++ ) {
        if ( words[i] !== „“ ) {
            context.write( words[i].toLowerCase(), 1 );
        }
    }
}; var reduce = function ( key, values, context ) {
    var sum = 0;
    while ( values.hasNext() ) {
        sum += parseInt( values.next() );
    }
    context.write( key, sum );
}
MapReduce
Data Systems and Their Timeframes
Does hadoop solve all my DATA
problems or is are there something
         else out there?
•   PIG         High-level MapReduce Language
•   HIVE        SQL Like high-level MapReduce Language
•   HBase       Realtime processing (based on google
                BigTable)
•   Accumulo    NSA fork of Hbase
•   Avro        Data Serialization
•   ZooKeeper   Low level coordination
•   HCatalog    Storage Management and interoperability
                between all systems
•   OOZIE       Job scheduler
•   Flume       Log and data aggregation
•   Whirr       Automated cloud cluster on ec2, rackspace etc
•   Sqoop       Relational data importer
•   MrUnit      Unit testing job
•   Mahout      Machine learning libraries
•   BigTop      Interoperability
•   Crunch      MapReduce pipelines in Java and Scala
•   Giraph      Processing math on huge distribute graphs

More Related Content

PDF
HadoopThe Hadoop Java Software Framework
PPTX
Hadoop eco system-first class
PPTX
Hadoop_EcoSystem_Pradeep_MG
PPTX
JOSA TechTalks - Big Data on Hadoop
PPT
Whirlwind tour of Hadoop and HIve
PDF
Terabyte-scale image similarity search: experience and best practice
PDF
Scalable high-dimensional indexing with Hadoop
PPTX
Hadoop And Big Data - My Presentation To Selective Audience
HadoopThe Hadoop Java Software Framework
Hadoop eco system-first class
Hadoop_EcoSystem_Pradeep_MG
JOSA TechTalks - Big Data on Hadoop
Whirlwind tour of Hadoop and HIve
Terabyte-scale image similarity search: experience and best practice
Scalable high-dimensional indexing with Hadoop
Hadoop And Big Data - My Presentation To Selective Audience

What's hot (20)

PDF
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
PPTX
Spark 计算模型
PPTX
Hadoop course curriculm
PPTX
Hadoop: The elephant in the room
PDF
Introduction to Hadoop and Big Data Processing
PDF
An introduction to Big-Data processing applying hadoop
PDF
TRHUG 2015 - Veloxity Big Data Migration Use Case
PPTX
Hadoop
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
PPTX
Time Series Data in a Time Series World
PPTX
Modern software design in Big data era
ODP
Google's Dremel
PPTX
A Hands-on Introduction to MapReduce (in Python)
PDF
Google App Engine BeCamp 2008
PPT
Another Intro To Hadoop
PPTX
Need for Time series Database
PPT
Dremel: Interactive Analysis of Web-Scale Datasets
PDF
알쓸신잡
PDF
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
PDF
Map reduce & HDFS with Hadoop
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Spark 计算模型
Hadoop course curriculm
Hadoop: The elephant in the room
Introduction to Hadoop and Big Data Processing
An introduction to Big-Data processing applying hadoop
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hadoop
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Time Series Data in a Time Series World
Modern software design in Big data era
Google's Dremel
A Hands-on Introduction to MapReduce (in Python)
Google App Engine BeCamp 2008
Another Intro To Hadoop
Need for Time series Database
Dremel: Interactive Analysis of Web-Scale Datasets
알쓸신잡
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
Map reduce & HDFS with Hadoop
Ad

Viewers also liked (20)

PDF
Getting started
PDF
Identidad verbal
PDF
Tutoria
PDF
Tutorial dropbox
DOCX
Balance between insight and noise indicia v2
PPTX
Presentation1.pptx 1
PPTX
PPT
PDF
New Media DL Day One Intro Deck
PPTX
PDF
New mediadl adwords_intro
PPTX
Score
PPT
111108 Succes
PPTX
Kolory jesieni
DOCX
Bala_krishna_resume
PPT
presentation
PPTX
ALEKS: How can we help at-risk students be more successful in math?
PDF
Tutorial Imagen
PDF
Expo marcas
PPT
Tumša nakte, zaļa zāle soc spele
Getting started
Identidad verbal
Tutoria
Tutorial dropbox
Balance between insight and noise indicia v2
Presentation1.pptx 1
New Media DL Day One Intro Deck
New mediadl adwords_intro
Score
111108 Succes
Kolory jesieni
Bala_krishna_resume
presentation
ALEKS: How can we help at-risk students be more successful in math?
Tutorial Imagen
Expo marcas
Tumša nakte, zaļa zāle soc spele
Ad

Similar to 2012 apache hadoop_map_reduce_windows_azure (20)

PPT
Hadoop by sunitha
PDF
Hadoop on Azure, Blue elephants
PPTX
Real time hadoop + mapreduce intro
PDF
Hadoop programming
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Big Data and Cloud Computing
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
Introduction to Apache Hadoop
PDF
An Introduction to Apache Hadoop, Mahout and HBase
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
Hands on Hadoop and pig
PPT
Brust hadoopecosystem
PPTX
Hadoop
PPTX
Large Scale Data With Hadoop
PPTX
Microsoft's Big Play for Big Data
PPTX
A gentle introduction to the world of BigData and Hadoop
PDF
Scaling Storage and Computation with Hadoop
PPTX
Hadoop/MapReduce/HDFS
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
PPT
Hadoop by sunitha
Hadoop on Azure, Blue elephants
Real time hadoop + mapreduce intro
Hadoop programming
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Big Data and Cloud Computing
EclipseCon Keynote: Apache Hadoop - An Introduction
Introduction to Apache Hadoop
An Introduction to Apache Hadoop, Mahout and HBase
2016-07-21-Godil-presentation.pptx
Hands on Hadoop and pig
Brust hadoopecosystem
Hadoop
Large Scale Data With Hadoop
Microsoft's Big Play for Big Data
A gentle introduction to the world of BigData and Hadoop
Scaling Storage and Computation with Hadoop
Hadoop/MapReduce/HDFS
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 2 Digital Image Fundamentals.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 2 Digital Image Fundamentals.pdf
20250228 LYD VKU AI Blended-Learning.pptx
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
madgavkar20181017ppt McKinsey Presentation.pdf
Modernizing your data center with Dell and AMD
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Monthly Chronicles - July 2025
GamePlan Trading System Review: Professional Trader's Honest Take
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Chapter 3 Spatial Domain Image Processing.pdf
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?

2012 apache hadoop_map_reduce_windows_azure

  • 1. Apache Hadoop, MapReduce & Windows Azure Guðmundur Jón Halldórsson Five Degrees July 2012
  • 2. Web crawler! „No this isn‘t about that“
  • 3. What is Hadoop? System for processing mind-boggingly large amount of data
  • 5. HDFS Hadoop Distributed File System Yes it is file system written in Java  And you can do normal file system operations like [ls, mkdir, ...]. Works best with large files. HDFS splits file into blocks of 128 MB (can be configures)
  • 6. HDFS HDFS will keep 3 copies of each block The NameNode tracks blocks and datanodes DN1 DN2 DN3 NN DN4 DN5 DN5 Namenode DN1, DN4, DN7 DN3, DN5, DN8 DN5 DN8 DN9 DN3, DN4, DN5
  • 7. Map-Reduce • Write a mapper that takes a key and value, emits zero or more new keys and values • Write a reducer all the values of one key and emits zero or more new keys and values
  • 8. Map-Reduce JS example var map = function ( key, value, context ) { var words = value.split(/[^a-zA-Z]/); for ( var i=0; i < words.length; i++ ) { if ( words[i] !== „“ ) { context.write( words[i].toLowerCase(), 1 ); } } }; var reduce = function ( key, values, context ) { var sum = 0; while ( values.hasNext() ) { sum += parseInt( values.next() ); } context.write( key, sum ); }
  • 10. Data Systems and Their Timeframes
  • 11. Does hadoop solve all my DATA problems or is are there something else out there?
  • 12. PIG High-level MapReduce Language • HIVE SQL Like high-level MapReduce Language • HBase Realtime processing (based on google BigTable) • Accumulo NSA fork of Hbase • Avro Data Serialization • ZooKeeper Low level coordination • HCatalog Storage Management and interoperability between all systems • OOZIE Job scheduler • Flume Log and data aggregation • Whirr Automated cloud cluster on ec2, rackspace etc • Sqoop Relational data importer • MrUnit Unit testing job • Mahout Machine learning libraries • BigTop Interoperability • Crunch MapReduce pipelines in Java and Scala • Giraph Processing math on huge distribute graphs