SlideShare a Scribd company logo
BigData using Hadoop and
           Pig

                Sudar Muthu
             Research Engineer
                Yahoo Labs
          http://sudarmuthu.com
      http://twitter.com/sudarmuthu
Who am I?
   Research Engineer at Yahoo Labs
   Mines useful information from huge datasets
   Worked on both structured and unstructured
    data.
   Builds robots as hobby ;)
What we will see today?
   What is BigData?
   Get our hands dirty with Hadoop
   See some code
   Try out Pig
   Glimpse of Hbase and Hive
What is BigData?
“   Big data is a collection of data sets so large   ”
    and complex that it becomes difficult to
    process using on-hand database
    management tools



        http://en.wikipedia.org/wiki/Big_data
How big is BigData?
1GB today is not the same
as 1GB just 10 years before
Anything that doesn’t fit
into the RAM of a single
         machine
Types of Big Data
Data in Movement (streams)
   Twitter/Facebook comments
   Stock market data
   Access logs of a busy web server
   Sensors: Vital signs of a newly born
Data at rest (Oceans)
   Collection of what has streamed
   Emails or IM messages
   Social Media
   Unstructured documents: forms, claims
We have all this data and
 need to find a way to
    process them
Traditional way of scaling
               (Scaling up)
   Make the machine more powerful
     Add more RAM
     Add more cores to CPU

   It is going to be very expensive
   Will be limited by disk seek and read time
   Single point of failure
New way to scale up (Scale out)
   Add more instances of the same machine
   Cost is less compared to scaling up
   Immune to failure of a single or a set of nodes
   Disk seek and write time is not going to be
    bottleneck
   Future safe (to some extend)
Is it fit for ALL types of
         problems?
Divide and conquer
Hadoop
A scalable, fault-tolerant
 grid operating system for
data storage and processing
What is Hadoop?
   Runs on Commodity hardware
   HDFS: Fault-tolerant high-bandwidth clustered
    storage
   MapReduce: Distributed data processing
   Works with structured and unstructured data
   Open source, Apache license
   Master (named-node) – Slave architecture
Design Principles
   System shall manage and heal itself
   Performance shall scale linearly
   Algorithm should move to data
       Lower latency, lower bandwidth
   Simple core, modular and extensible
Components of Hadoop
   HDFS
   Map Reduce
   PIG
   HBase
   Hive
Getting started with
      Hadoop
What I am not going to cover?
   Installation or setting up Hadoop
       Will be running all the code in a single node instance
   Monitoring of the clusters
   Performance tuning
   User authentication or quota
Before we get into code,
 let’s understand some
        concepts
Map Reduce
Framework for distributed
processing of large datasets
MapReduce
Consists of two functions
 Map
       Filter and transform the input, which the reducer
        can understand
   Reduce
       Aggregate over the input provided by the Map
        function
Formal definition
Map
<k1, v1> -> list(<k2,v2>)



Reduce
<k2, list(v2)>   -> list <k3, v3>
Let’s see some examples
Count number of words in files
Map
<file_name, file_contents> => list<word, count>

Reduce
<word, list(count)> => <word, sum_of_counts>
Count number of words in files
Map
<“file1”, “to be or not to be”> =>
{<“to”,1>,
<“be”,1>,
<“or”,1>,
<“not”,1>,
<“to,1>,
<“be”,1>}
Count number of words in files
Reduce
{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>,
<“not”,<1>>}

=>

{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
Max temperature in a year
Map
<file_name, file_contents> => <year, temp>

Reduce
<year, list(temp)> => <year, max_temp>
HDFS
HDFS
   Distributed file system
   Data is distributed over different nodes
   Will be replicated for fail over
   Is abstracted out for the algorithms
Hands on Hadoop and pig
Hands on Hadoop and pig
HDFS Commands
HDFS Commands
   hadoop fs –mkdir <dir_name>
   hadoop fs –ls <dir_name>
   hadoop fs –rmr <dir_name>
   hadoop fs –put <local_file> <remote_dir>
   hadoop fs –get <remote_file> <local_dir>
   hadoop fs –cat <remote_file>
   hadoop fs –help
Let’s write some code
Count Words Demo
   Create a mapper class
       Override map() method
   Create a reducer class
       Override reduce() method
   Create a main method
   Create JAR
   Run it on Hadoop
Map Method
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {

  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);

   while (itr.hasMoreTokens()) {
     context.write(new Text(itr.nextToken()), new
IntWritable(1));
   }
}
Reduce Method
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {

    int sum = 0;
    for (IntWritable value : values) {
       sum += value.get();
    }
    context.write(key, new IntWritable(sum));
}
Main Method
Job job = new Job();
job.setJarByClass(CountWords.class);
job.setJobName("Count Words");

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(CountWordsMapper.class);

job.setReducerClass(CountWordsReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Run it on Hadoop


hadoop jar dist/countwords.jar
com.sudarmuthu.hadoop.countwords.CountWord
s input/ output/
Output
at          1
be          3
can         7
can't       1
code        2
command     1
connect     1
consider    1
continued   1
control     4
could       1
couple      1
courtesy    1
desktop,    1
detailed    1
details     1
…..
…..
Pig
What is Pig?
Pig provides an abstraction for processing large
datasets

Consists of
 Pig Latin – Language to express data flows

 Execution environment
Why we need Pig?
   MapReduce can get complex if your data needs
    lot of processing/transformations
   MapReduce provides primitive data structures
   Pig provides rich data structures
   Supports complex operations like joins
Running Pig programs
   In an interactive shell called Grunt
   As a Pig Script
   Embedded into Java programs (like JDBC)
Grunt – Interactive Shell
Grunt shell
   fs commands – like hadoop fs
     fs –ls
     Fs –mkdir

   fs copyToLocal <file>
   fs copyFromLocal <local_file> <dest>
   exec – execute Pig scripts
   sh – execute shell scripts
Let’s see them in action
Pig Latin
   LOAD – Read files
   DUMP – Dump data in the console
   JOIN – Do a join on data sets
   FILTER – Filter data sets
   SORT – Sort data
   STORE – Store data back in files
Let’s see some code
Sort words based on count
Filter words present in a list
HBase
What is Hbase?
   Distributed, column-oriented database built on
    top of HDFS
   Useful when real-time read/write random-access
    to very large datasets is needed.
   Can handle billions of rows with millions of
    columns
Hive
What is Hive?
   Useful for managing and querying structured
    data
   Provides SQL like syntax
   Meta data is stored in a RDBMS
   Extensible with types, functions , scripts etc
Hadoop                           Relational Databases
   Affordable                      Interactive response times
    Storage/Compute                 ACID
   Structured or Unstructured      Structured data
   Resilient Auto Scalability      Cost/Scale prohibitive
Thank You

More Related Content

PPTX
Pig workshop
PDF
Practical pig
PPT
Hive - SerDe and LazySerde
PDF
Apache avro and overview hadoop tools
PPTX
Avro introduction
PPTX
Bioinformatics p5-bioperlv2014
PDF
Pl python python w postgre-sql
PPTX
Migrating to Puppet 4.0
Pig workshop
Practical pig
Hive - SerDe and LazySerde
Apache avro and overview hadoop tools
Avro introduction
Bioinformatics p5-bioperlv2014
Pl python python w postgre-sql
Migrating to Puppet 4.0

What's hot (20)

PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
PPTX
Hadoop Streaming Tutorial With Python
PPT
Bioinformatica 10-11-2011-p6-bioperl
PPTX
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
PDF
Introduction to Scalding and Monoids
ODP
Programming Under Linux In Python
PDF
JRubyKaigi2010 Hadoop Papyrus
PDF
Scalding for Hadoop
PPTX
Writing Hadoop Jobs in Scala using Scalding
PPTX
Scoobi - Scala for Startups
PDF
20080529dublinpt2
PDF
Python and sysadmin I
PPTX
Should I Use Scalding or Scoobi or Scrunch?
ODP
Intro to The PHP SPL
PPTX
SPL: The Undiscovered Library - DataStructures
ODP
Introducing Modern Perl
PDF
Howto argparse
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
PPTX
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
PPT
Programming in Computational Biology
Scalding - Hadoop Word Count in LESS than 70 lines of code
Hadoop Streaming Tutorial With Python
Bioinformatica 10-11-2011-p6-bioperl
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Introduction to Scalding and Monoids
Programming Under Linux In Python
JRubyKaigi2010 Hadoop Papyrus
Scalding for Hadoop
Writing Hadoop Jobs in Scala using Scalding
Scoobi - Scala for Startups
20080529dublinpt2
Python and sysadmin I
Should I Use Scalding or Scoobi or Scrunch?
Intro to The PHP SPL
SPL: The Undiscovered Library - DataStructures
Introducing Modern Perl
Howto argparse
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
Programming in Computational Biology
Ad

Viewers also liked (7)

PDF
Coscup 2013 : Continuous Integration on top of hadoop
PPTX
Practical Pig and PigUnit (Michael Noll, Verisign)
PPTX
Introduction to Apache Pig
PDF
Unit testing of spark applications
PDF
Apache ZooKeeper
PPTX
Apache Kafka
PPTX
Introduction to Apache Kafka
Coscup 2013 : Continuous Integration on top of hadoop
Practical Pig and PigUnit (Michael Noll, Verisign)
Introduction to Apache Pig
Unit testing of spark applications
Apache ZooKeeper
Apache Kafka
Introduction to Apache Kafka
Ad

Similar to Hands on Hadoop and pig (20)

PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Big data concepts
PDF
Hadoop breizhjug
PPTX
Hadoop for sysadmins
DOC
PDF
Apache Hadoop 1.1
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Hadoop Overview & Architecture
 
PPTX
Hadoop workshop
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
PPTX
Big data Analytics Hadoop
PPTX
Hadoop An Introduction
PPTX
Introduction to Apache Hadoop
PDF
Hadoop introduction
PDF
Lecture 2 part 3
PDF
Scaling Storage and Computation with Hadoop
DOCX
Hadoop Seminar Report
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Big data concepts
Hadoop breizhjug
Hadoop for sysadmins
Apache Hadoop 1.1
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Hadoop Overview & Architecture
 
Hadoop workshop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Big data Analytics Hadoop
Hadoop An Introduction
Introduction to Apache Hadoop
Hadoop introduction
Lecture 2 part 3
Scaling Storage and Computation with Hadoop
Hadoop Seminar Report
2016-07-21-Godil-presentation.pptx
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune

More from Sudar Muthu (20)

PPTX
A quick preview of WP CLI - Chennai WordPress Meetup
PDF
WordPress Developer tools
PDF
WordPress Developer Tools to increase productivity
PDF
Unit testing for WordPress
PDF
Unit testing in php
PPTX
Using arduino and raspberry pi for internet of things
PPTX
How arduino helped me in life
PPTX
Having fun with hardware
PPTX
Getting started with arduino workshop
PPTX
Python in raspberry pi
PPTX
Hack 101 at IIT Kanpur
PPTX
PureCSS open hack 2013
PPTX
Arduino Robotics workshop day2
PPTX
Arduino Robotics workshop Day1
PPTX
Lets make robots
PPTX
Capabilities of Arduino (including Due)
PPTX
Controlling robots using javascript
PPTX
Picture perfect hacks with flickr API
PPTX
Hacking 101
PPTX
Capabilities of Arduino
A quick preview of WP CLI - Chennai WordPress Meetup
WordPress Developer tools
WordPress Developer Tools to increase productivity
Unit testing for WordPress
Unit testing in php
Using arduino and raspberry pi for internet of things
How arduino helped me in life
Having fun with hardware
Getting started with arduino workshop
Python in raspberry pi
Hack 101 at IIT Kanpur
PureCSS open hack 2013
Arduino Robotics workshop day2
Arduino Robotics workshop Day1
Lets make robots
Capabilities of Arduino (including Due)
Controlling robots using javascript
Picture perfect hacks with flickr API
Hacking 101
Capabilities of Arduino

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
August Patch Tuesday
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Machine Learning_overview_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
August Patch Tuesday
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
A comparative study of natural language inference in Swahili using monolingua...
Assigned Numbers - 2025 - Bluetooth® Document
Machine Learning_overview_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Getting Started with Data Integration: FME Form 101
Mobile App Security Testing_ A Comprehensive Guide.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
OMC Textile Division Presentation 2021.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm

Hands on Hadoop and pig

  • 1. BigData using Hadoop and Pig Sudar Muthu Research Engineer Yahoo Labs http://sudarmuthu.com http://twitter.com/sudarmuthu
  • 2. Who am I?  Research Engineer at Yahoo Labs  Mines useful information from huge datasets  Worked on both structured and unstructured data.  Builds robots as hobby ;)
  • 3. What we will see today?  What is BigData?  Get our hands dirty with Hadoop  See some code  Try out Pig  Glimpse of Hbase and Hive
  • 5. Big data is a collection of data sets so large ” and complex that it becomes difficult to process using on-hand database management tools http://en.wikipedia.org/wiki/Big_data
  • 6. How big is BigData?
  • 7. 1GB today is not the same as 1GB just 10 years before
  • 8. Anything that doesn’t fit into the RAM of a single machine
  • 10. Data in Movement (streams)  Twitter/Facebook comments  Stock market data  Access logs of a busy web server  Sensors: Vital signs of a newly born
  • 11. Data at rest (Oceans)  Collection of what has streamed  Emails or IM messages  Social Media  Unstructured documents: forms, claims
  • 12. We have all this data and need to find a way to process them
  • 13. Traditional way of scaling (Scaling up)  Make the machine more powerful  Add more RAM  Add more cores to CPU  It is going to be very expensive  Will be limited by disk seek and read time  Single point of failure
  • 14. New way to scale up (Scale out)  Add more instances of the same machine  Cost is less compared to scaling up  Immune to failure of a single or a set of nodes  Disk seek and write time is not going to be bottleneck  Future safe (to some extend)
  • 15. Is it fit for ALL types of problems?
  • 18. A scalable, fault-tolerant grid operating system for data storage and processing
  • 19. What is Hadoop?  Runs on Commodity hardware  HDFS: Fault-tolerant high-bandwidth clustered storage  MapReduce: Distributed data processing  Works with structured and unstructured data  Open source, Apache license  Master (named-node) – Slave architecture
  • 20. Design Principles  System shall manage and heal itself  Performance shall scale linearly  Algorithm should move to data  Lower latency, lower bandwidth  Simple core, modular and extensible
  • 21. Components of Hadoop  HDFS  Map Reduce  PIG  HBase  Hive
  • 23. What I am not going to cover?  Installation or setting up Hadoop  Will be running all the code in a single node instance  Monitoring of the clusters  Performance tuning  User authentication or quota
  • 24. Before we get into code, let’s understand some concepts
  • 27. MapReduce Consists of two functions  Map  Filter and transform the input, which the reducer can understand  Reduce  Aggregate over the input provided by the Map function
  • 28. Formal definition Map <k1, v1> -> list(<k2,v2>) Reduce <k2, list(v2)> -> list <k3, v3>
  • 29. Let’s see some examples
  • 30. Count number of words in files Map <file_name, file_contents> => list<word, count> Reduce <word, list(count)> => <word, sum_of_counts>
  • 31. Count number of words in files Map <“file1”, “to be or not to be”> => {<“to”,1>, <“be”,1>, <“or”,1>, <“not”,1>, <“to,1>, <“be”,1>}
  • 32. Count number of words in files Reduce {<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>, <“not”,<1>>} => {<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
  • 33. Max temperature in a year Map <file_name, file_contents> => <year, temp> Reduce <year, list(temp)> => <year, max_temp>
  • 34. HDFS
  • 35. HDFS  Distributed file system  Data is distributed over different nodes  Will be replicated for fail over  Is abstracted out for the algorithms
  • 39. HDFS Commands  hadoop fs –mkdir <dir_name>  hadoop fs –ls <dir_name>  hadoop fs –rmr <dir_name>  hadoop fs –put <local_file> <remote_dir>  hadoop fs –get <remote_file> <local_dir>  hadoop fs –cat <remote_file>  hadoop fs –help
  • 41. Count Words Demo  Create a mapper class  Override map() method  Create a reducer class  Override reduce() method  Create a main method  Create JAR  Run it on Hadoop
  • 42. Map Method public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); } }
  • 43. Reduce Method public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); }
  • 44. Main Method Job job = new Job(); job.setJarByClass(CountWords.class); job.setJobName("Count Words"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(CountWordsMapper.class); job.setReducerClass(CountWordsReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
  • 45. Run it on Hadoop hadoop jar dist/countwords.jar com.sudarmuthu.hadoop.countwords.CountWord s input/ output/
  • 46. Output at 1 be 3 can 7 can't 1 code 2 command 1 connect 1 consider 1 continued 1 control 4 could 1 couple 1 courtesy 1 desktop, 1 detailed 1 details 1 ….. …..
  • 47. Pig
  • 48. What is Pig? Pig provides an abstraction for processing large datasets Consists of  Pig Latin – Language to express data flows  Execution environment
  • 49. Why we need Pig?  MapReduce can get complex if your data needs lot of processing/transformations  MapReduce provides primitive data structures  Pig provides rich data structures  Supports complex operations like joins
  • 50. Running Pig programs  In an interactive shell called Grunt  As a Pig Script  Embedded into Java programs (like JDBC)
  • 52. Grunt shell  fs commands – like hadoop fs  fs –ls  Fs –mkdir  fs copyToLocal <file>  fs copyFromLocal <local_file> <dest>  exec – execute Pig scripts  sh – execute shell scripts
  • 53. Let’s see them in action
  • 54. Pig Latin  LOAD – Read files  DUMP – Dump data in the console  JOIN – Do a join on data sets  FILTER – Filter data sets  SORT – Sort data  STORE – Store data back in files
  • 56. Sort words based on count
  • 57. Filter words present in a list
  • 58. HBase
  • 59. What is Hbase?  Distributed, column-oriented database built on top of HDFS  Useful when real-time read/write random-access to very large datasets is needed.  Can handle billions of rows with millions of columns
  • 60. Hive
  • 61. What is Hive?  Useful for managing and querying structured data  Provides SQL like syntax  Meta data is stored in a RDBMS  Extensible with types, functions , scripts etc
  • 62. Hadoop Relational Databases  Affordable  Interactive response times Storage/Compute  ACID  Structured or Unstructured  Structured data  Resilient Auto Scalability  Cost/Scale prohibitive