SlideShare a Scribd company logo
PIG: A Big Data Processor
Tushar B. Kute,
http://tusharkute.com
What is Pig?
• Apache Pig is an abstraction over MapReduce. It is a
tool/platform which is used to analyze larger sets of
data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all
the data manipulation operations in Hadoop using
Apache Pig.
• To write data analysis programs, Pig provides a high-
level language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for
reading, writing, and processing data.
Apache Pig
• To analyze data using Apache Pig, programmers
need to write scripts using Pig Latin language.
• All these scripts are internally converted to Map
and Reduce tasks.
• Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as
input and converts those scripts into
MapReduce jobs.
Why do we need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks
easily without having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the
length of codes. For example, an operation that would require you
to type 200 lines of code (LoC) in Java can be easily done by typing
as less as just 10 LoC in Apache Pig. Ultimately, Apache Pig
reduces the development time by almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig
when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data
operations like joins, filters, ordering, etc. In addition, it also
provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
Features of Pig
• Rich set of operators: It provides many operators to perform
operations like join, sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write
a Pig script if you are good at SQL.
• Optimization opportunities: The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
• Extensibility: Using the existing operators, users can develop their
own functions to read, process, and write data.
• UDF’s: Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed them
in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Pig vs. MapReduce
Pig vs. SQL
Pig vs. Hive
Applications of Apache Pig
• To process huge data sources such as web logs.
• To perform data processing for search
platforms.
• To process time sensitive data loads.
Apache Pig – History
• In 2006, Apache Pig was developed as a
research project at Yahoo, especially to create
and execute MapReduce jobs on every dataset.
• In 2007, Apache Pig was open sourced via
Apache incubator.
• In 2008, the first release of Apache Pig came
out. In 2010, Apache Pig graduated as an
Apache top-level project.
Pig Architecture
Apache Pig – Components
• Parser: Initially the Pig Scripts are handled by the Parser. It
checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin
statements and logical operators.
• Optimizer: The logical plan (DAG) is passed to the logical
optimizer, which carries out the logical optimizations such as
projection and pushdown.
• Compiler: The compiler compiles the optimized logical plan
into a series of MapReduce jobs.
• Execution engine: Finally the MapReduce jobs are submitted
to Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.
Apache Pig – Data Model
Apache Pig – Elements
• Atom
– Any single value in Pig Latin, irrespective of their
data, type is known as an Atom.
– It is stored as string and can be used as string
and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.
– A piece of data or a simple atomic value is known
as a field.
– Example: ‘raja’ or ‘30’
Apache Pig – Elements
• Tuple
– A record that is formed by an ordered set of
fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of
RDBMS.
– Example: (Raja, 30)
Apache Pig – Elements
• Bag
– A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in RDBMS,
but unlike a table in RDBMS, it is not necessary that every
tuple contain the same number of fields or that the fields
in th same position (column) have the same type.
– Example: {(Raja, 30), (Mohammad, 45)}
– A bag can be a field in a relation; in that context, it is
known as inner bag.
– Example: {Raja, 30, {9848022338, raja@gmail.com,}}
Apache Pig – Elements
• Relation
– A relation is a bag of tuples. The relations in Pig
Latin are unordered (there is no guarantee that
tuples are processed in any particular order).
• Map
– A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should
be unique. The value might be of any type. It is
represented by ‘[]’
– Example: [name#Raja, age#30]
Installation of PIG
Download
• Download the tar.gz file of Apache Pig from
here:
http://mirror.fibergrid.in/apache/pig/pig-0.15.0/
pig-0.15.0.tar.gz
Extract and copy
• Extract this file using right-click -> 'Extract here'
option or by tar -xzvf command.
• Rename the created folder 'pig-0.15.0' to 'pig'
• Now, move this folder to /usr/lib using following
command:
$ sudo mv pig/ /usr/lib
Edit the bashrc file
• Open the bashrc file:
sudo gedit ~/.bashrc
• Go to end of the file and add following lines.
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
• Type following command to make it in effect:
source ~/.bashrc
Start the Pig
• Start the pig in local mode:
pig -x local
• Start the pig in mapreduce mode (needs hadoop
datanode started):
pig -x mapreduce
Grunt shell
Data Processing with PIG
Example: movies_data.csv
1,Dhadakebaz,1986,3.2,7560
2,Dhumdhadaka,1985,3.8,6300
3,Ashi hi banva banvi,1988,4.1,7802
4,Zapatlela,1993,3.7,6022
5,Ayatya Gharat Gharoba,1991,3.4,5420
6,Navra Maza Navsacha,2004,3.9,4904
7,De danadan,1987,3.4,5623
8,Gammat Jammat,1987,3.4,7563
9,Eka peksha ek,1990,3.2,6244
10,Pachhadlela,2004,3.1,6956
Load data
• $ pig -x local
• grunt> movies = LOAD
'movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration)
• grunt> dump movies;
it displays the contents
Filter data
• grunt> movies_greater_than_35 =
FILTER movies BY (float)rating > 3.5;
• grunt> dump movies_greater_than_35;
Store the results data
• grunt> store movies_greater_than_35
into 'my_movies';
• It stores the result in local file system directory
named 'my_movies'.
Display the result
• Now display the result from local file system.
cat my_movies/part-m-00000
Load command
• The load command specified only the column
names. We can modify the statement as follows
to include the data type of the columns:
• grunt> movies = LOAD 
'movies_data.csv' USING 
PigStorage(',') as (id:int, 
name:chararray, year:int, 
rating:double, duration:int);
Check the filters
• List the movies that were released between 1950 and
1960
grunt> movies_between_90_95 = FILTER 
movies by year > 1990 and year < 1995;
• List the movies that start with the Alpahbet D
grunt> movies_starting_with_D = FILTER 
movies by name matches 'D.*';
• List the movies that have duration greater that 2 hours
grunt> movies_duration_2_hrs = FILTER 
movies by duration > 7200; 
Output
Movies between
1990 to 1995
Movies starts
W
ith 'D'
Movies greater
Than 2 hours
Describe
• DESCRIBE The schema of a relation/alias can be
viewed using the DESCRIBE command:
grunt> DESCRIBE movies;
movies: {id: int, name: chararray, 
year: int, rating: double, duration: 
int} 
Foreach
• FOREACH gives a simple way to apply
transformations based on columns. Let’s understand
this with an example.
• List the movie names its duration in minutes
grunt> movie_duration = FOREACH movies 
GENERATE name, (double)(duration/60);
• The above statement generates a new alias that has
the list of movies and it duration in minutes.
• You can check the results using the DUMP command.
Output
Group
• The GROUP keyword is used to group fields in a
relation.
• List the years and the number of movies released
each year.
grunt> grouped_by_year = group movies 
by year;
grunt> count_by_year = FOREACH 
grouped_by_year GENERATE group, 
COUNT(movies);
Output
Order by
• Let us question the data to illustrate the ORDER BY
operation.
• List all the movies in the ascending order of year.
grunt> desc_movies_by_year = ORDER 
movies BY year ASC;
grunt> DUMP desc_movies_by_year;
• List all the movies in the descending order of year.
grunt> asc_movies_by_year = ORDER movies 
by year DESC;
grunt> DUMP asc_movies_by_year;
Output- Ascending by year
From
1985
To
2004
Limit
• Use the LIMIT keyword to get only a limited number
for results from relation.
grunt> top_5_movies = LIMIT movies 5;
grunt> DUMP top_10_movies;
Pig: Modes of Execution
• Pig programs can be run in three methods which
work in both local and MapReduce mode. They are
– Script Mode
– Grunt Mode
– Embedded Mode
Script mode
• Script Mode or Batch Mode: In script mode, pig runs
the commands specified in a script file. The following
example shows how to run a pig programs from a
script file:
$ vim scriptfile.pig
   A = LOAD 'script_file';
   DUMP A;
$ pig ­x local scriptfile.pig
Grunt mode
• Grunt Mode or Interactive Mode: The grunt mode can also
be called as interactive mode. Grunt is pig's interactive shell.
It is started when no file is specified for pig to run.
$ pig ­x local
grunt> A = LOAD 'grunt_file';
grunt> DUMP A;
• You can also run pig scripts from grunt using run and exec
commands.
grunt> run scriptfile.pig
grunt> exec scriptfile.pig
Embedded mode
• You can embed pig programs in Java, python and
ruby and can run from the same.
Example: Wordcount program
• Q) How to find the number of occurrences of the
words in a file using the pig script?
• You can find the famous word count example written
in map reduce programs in apache website. Here we
will write a simple pig script for the word count
problem.
• The pig script given in next slide finds the number of
times a word repeated in a file:
Example: text file- shivneri.txt
Example: Wordcount program
lines = LOAD 'shivneri.txt' AS 
(line:chararray);
words = FOREACH lines GENERATE 
FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group, 
COUNT(words);
DUMP w_count;
forts.pig
Output snapshot
$ pig -x local forts.pig
References
• “Programming Pig” by Alan Gates, O'Reilly
Publishers.
• “Pig Design Patterns” by Pradeep Pasupuleti,
PACKT Publishing
• Tutorials Point
• http://github.com/rohitdens
• http://pig.apache.org
tushar@tusharkute.com
Thank you
This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public License
Blogs
http://digitallocha.blogspot.in
http://kyamputar.blogspot.in
Web Resources
http://tusharkute.com

More Related Content

PPTX
introduction to NOSQL Database
PDF
Hadoop data management
PPTX
MongoDB presentation
PDF
Apache Flume
PDF
The CAP Theorem
PDF
Big Data technology Landscape
PPTX
Multimedia Database
PPTX
Design of Hadoop Distributed File System
introduction to NOSQL Database
Hadoop data management
MongoDB presentation
Apache Flume
The CAP Theorem
Big Data technology Landscape
Multimedia Database
Design of Hadoop Distributed File System

What's hot (20)

PPT
Map reduce in BIG DATA
PPTX
Slide #1:Introduction to Apache Storm
PPT
Object Oriented Database Management System
PPTX
Distributed DBMS - Unit 5 - Semantic Data Control
PDF
PPT
5.1 mining data streams
PPTX
Apache PIG
PPTX
MapReduce.pptx
PDF
Hadoop & MapReduce
PPT
Disk scheduling
PPTX
DNS Security Presentation ISSA
PPTX
The Basics of MongoDB
PPT
01 Data Mining: Concepts and Techniques, 2nd ed.
PDF
PPTX
Odbms concepts
PDF
HDFS Architecture
PPTX
ADBMS Object and Object Relational Databases
PPTX
Introduction to Pig
PDF
Ddb 1.6-design issues
PPT
Anatomy of classic map reduce in hadoop
Map reduce in BIG DATA
Slide #1:Introduction to Apache Storm
Object Oriented Database Management System
Distributed DBMS - Unit 5 - Semantic Data Control
5.1 mining data streams
Apache PIG
MapReduce.pptx
Hadoop & MapReduce
Disk scheduling
DNS Security Presentation ISSA
The Basics of MongoDB
01 Data Mining: Concepts and Techniques, 2nd ed.
Odbms concepts
HDFS Architecture
ADBMS Object and Object Relational Databases
Introduction to Pig
Ddb 1.6-design issues
Anatomy of classic map reduce in hadoop
Ad

Viewers also liked (20)

PDF
Signal Handling in Linux
PPTX
Introduction to Apache Pig
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
PDF
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
PDF
Apache Pig for Data Scientists
PDF
Part 02 Linux Kernel Module Programming
PPT
MIS 02 foundations of information systems
PPTX
Mis 03 management information systems
PPTX
MIS 04 Management of Business
PPTX
Introduction to linux ppt
PPTX
Linux.ppt
PPT
Study techniques of programming in c at kkwpss
PPTX
MIS 05 Decision Support Systems
PDF
Apache Pig - JavaZone 2013
PPTX
Introduction to Apache Pig
PDF
Open source applications softwares
PDF
NoSQL Databases Introduction - UTN 2013
PDF
Module 1 introduction to Linux
PDF
Module 02 Using Linux Command Shell
PDF
Apache Pig: Making data transformation easy
Signal Handling in Linux
Introduction to Apache Pig
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Apache Pig for Data Scientists
Part 02 Linux Kernel Module Programming
MIS 02 foundations of information systems
Mis 03 management information systems
MIS 04 Management of Business
Introduction to linux ppt
Linux.ppt
Study techniques of programming in c at kkwpss
MIS 05 Decision Support Systems
Apache Pig - JavaZone 2013
Introduction to Apache Pig
Open source applications softwares
NoSQL Databases Introduction - UTN 2013
Module 1 introduction to Linux
Module 02 Using Linux Command Shell
Apache Pig: Making data transformation easy
Ad

Similar to Apache Pig: A big data processor (20)

PPTX
Apache pig
PPTX
power point presentation on pig -hadoop framework
PPTX
Pig power tools_by_viswanath_gangavaram
PDF
unit-4-apache pig-.pdf
PPTX
Unit 4-apache pig
PDF
Big Data Hadoop Training
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PPTX
Introduction to pig.
PPTX
Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...
PPTX
PigHive.pptx
PPTX
Understanding Pig and Hive in Apache Hadoop
PDF
Unit V.pdf
PPTX
An Introduction to Apache Pig
PPTX
Unit-5 [Pig] working and architecture.pptx
PPT
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
PDF
43_Sameer_Kumar_Das2
PPTX
M4,C5 APACHE PIG.pptx
PPTX
PigHive.pptx
PPTX
PigHive presentation and hive impor.pptx
PPTX
03 pig intro
Apache pig
power point presentation on pig -hadoop framework
Pig power tools_by_viswanath_gangavaram
unit-4-apache pig-.pdf
Unit 4-apache pig
Big Data Hadoop Training
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Introduction to pig.
Enhancing Big Data Analytics with Pig and Hadoop: Harnessing the Power of Dis...
PigHive.pptx
Understanding Pig and Hive in Apache Hadoop
Unit V.pdf
An Introduction to Apache Pig
Unit-5 [Pig] working and architecture.pptx
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
43_Sameer_Kumar_Das2
M4,C5 APACHE PIG.pptx
PigHive.pptx
PigHive presentation and hive impor.pptx
03 pig intro

More from Tushar B Kute (20)

PDF
ॲलन ट्युरिंग: कृत्रिम बुद्धिमत्तेचा अग्रदूत - लेखक: तुषार भ. कुटे.pdf
PDF
01 Introduction to Android
PDF
Ubuntu OS and it's Flavours
PDF
Install Drupal in Ubuntu by Tushar B. Kute
PDF
Install Wordpress in Ubuntu Linux by Tushar B. Kute
PDF
Share File easily between computers using sftp
PDF
Implementation of FIFO in Linux
PDF
Implementation of Pipe in Linux
PDF
Basic Multithreading using Posix Threads
PDF
Part 04 Creating a System Call in Linux
PDF
Part 03 File System Implementation in Linux
PDF
Part 01 Linux Kernel Compilation (Ubuntu)
PDF
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
PDF
Technical blog by Engineering Students of Sandip Foundation, itsitrc
PDF
Chapter 01 Introduction to Java by Tushar B Kute
PDF
Chapter 02: Classes Objects and Methods Java by Tushar B Kute
PDF
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
PDF
Module 01 Introduction to Linux
PDF
Module 03 Programming on Linux
PDF
See through C
ॲलन ट्युरिंग: कृत्रिम बुद्धिमत्तेचा अग्रदूत - लेखक: तुषार भ. कुटे.pdf
01 Introduction to Android
Ubuntu OS and it's Flavours
Install Drupal in Ubuntu by Tushar B. Kute
Install Wordpress in Ubuntu Linux by Tushar B. Kute
Share File easily between computers using sftp
Implementation of FIFO in Linux
Implementation of Pipe in Linux
Basic Multithreading using Posix Threads
Part 04 Creating a System Call in Linux
Part 03 File System Implementation in Linux
Part 01 Linux Kernel Compilation (Ubuntu)
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
Technical blog by Engineering Students of Sandip Foundation, itsitrc
Chapter 01 Introduction to Java by Tushar B Kute
Chapter 02: Classes Objects and Methods Java by Tushar B Kute
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
Module 01 Introduction to Linux
Module 03 Programming on Linux
See through C

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Mushroom cultivation and it's methods.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
August Patch Tuesday
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Mushroom cultivation and it's methods.pdf
Heart disease approach using modified random forest and particle swarm optimi...
OMC Textile Division Presentation 2021.pptx
August Patch Tuesday
MIND Revenue Release Quarter 2 2025 Press Release
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Univ-Connecticut-ChatGPT-Presentaion.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
Tartificialntelligence_presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence

Apache Pig: A big data processor

  • 1. PIG: A Big Data Processor Tushar B. Kute, http://tusharkute.com
  • 2. What is Pig? • Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. • Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. • To write data analysis programs, Pig provides a high- level language known as Pig Latin. • This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.
  • 3. Apache Pig • To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. • All these scripts are internally converted to Map and Reduce tasks. • Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 4. Why do we need Apache Pig? • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java. • Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately, Apache Pig reduces the development time by almost 16 times. • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL. • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
  • 5. Features of Pig • Rich set of operators: It provides many operators to perform operations like join, sort, filer, etc. • Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. • Optimization opportunities: The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. • Extensibility: Using the existing operators, users can develop their own functions to read, process, and write data. • UDF’s: Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. • Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
  • 9. Applications of Apache Pig • To process huge data sources such as web logs. • To perform data processing for search platforms. • To process time sensitive data loads.
  • 10. Apache Pig – History • In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and execute MapReduce jobs on every dataset. • In 2007, Apache Pig was open sourced via Apache incubator. • In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project.
  • 12. Apache Pig – Components • Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. • Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. • Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs. • Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 13. Apache Pig – Data Model
  • 14. Apache Pig – Elements • Atom – Any single value in Pig Latin, irrespective of their data, type is known as an Atom. – It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. – A piece of data or a simple atomic value is known as a field. – Example: ‘raja’ or ‘30’
  • 15. Apache Pig – Elements • Tuple – A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. – Example: (Raja, 30)
  • 16. Apache Pig – Elements • Bag – A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in th same position (column) have the same type. – Example: {(Raja, 30), (Mohammad, 45)} – A bag can be a field in a relation; in that context, it is known as inner bag. – Example: {Raja, 30, {9848022338, [email protected],}}
  • 17. Apache Pig – Elements • Relation – A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order). • Map – A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ – Example: [name#Raja, age#30]
  • 19. Download • Download the tar.gz file of Apache Pig from here: http://mirror.fibergrid.in/apache/pig/pig-0.15.0/ pig-0.15.0.tar.gz
  • 20. Extract and copy • Extract this file using right-click -> 'Extract here' option or by tar -xzvf command. • Rename the created folder 'pig-0.15.0' to 'pig' • Now, move this folder to /usr/lib using following command: $ sudo mv pig/ /usr/lib
  • 21. Edit the bashrc file • Open the bashrc file: sudo gedit ~/.bashrc • Go to end of the file and add following lines. export PIG_HOME=/usr/lib/pig export PATH=$PATH:$PIG_HOME/bin • Type following command to make it in effect: source ~/.bashrc
  • 22. Start the Pig • Start the pig in local mode: pig -x local • Start the pig in mapreduce mode (needs hadoop datanode started): pig -x mapreduce
  • 26. Load data • $ pig -x local • grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration) • grunt> dump movies; it displays the contents
  • 27. Filter data • grunt> movies_greater_than_35 = FILTER movies BY (float)rating > 3.5; • grunt> dump movies_greater_than_35;
  • 28. Store the results data • grunt> store movies_greater_than_35 into 'my_movies'; • It stores the result in local file system directory named 'my_movies'.
  • 29. Display the result • Now display the result from local file system. cat my_movies/part-m-00000
  • 30. Load command • The load command specified only the column names. We can modify the statement as follows to include the data type of the columns: • grunt> movies = LOAD  'movies_data.csv' USING  PigStorage(',') as (id:int,  name:chararray, year:int,  rating:double, duration:int);
  • 31. Check the filters • List the movies that were released between 1950 and 1960 grunt> movies_between_90_95 = FILTER  movies by year > 1990 and year < 1995; • List the movies that start with the Alpahbet D grunt> movies_starting_with_D = FILTER  movies by name matches 'D.*'; • List the movies that have duration greater that 2 hours grunt> movies_duration_2_hrs = FILTER  movies by duration > 7200; 
  • 32. Output Movies between 1990 to 1995 Movies starts W ith 'D' Movies greater Than 2 hours
  • 33. Describe • DESCRIBE The schema of a relation/alias can be viewed using the DESCRIBE command: grunt> DESCRIBE movies; movies: {id: int, name: chararray,  year: int, rating: double, duration:  int} 
  • 34. Foreach • FOREACH gives a simple way to apply transformations based on columns. Let’s understand this with an example. • List the movie names its duration in minutes grunt> movie_duration = FOREACH movies  GENERATE name, (double)(duration/60); • The above statement generates a new alias that has the list of movies and it duration in minutes. • You can check the results using the DUMP command.
  • 36. Group • The GROUP keyword is used to group fields in a relation. • List the years and the number of movies released each year. grunt> grouped_by_year = group movies  by year; grunt> count_by_year = FOREACH  grouped_by_year GENERATE group,  COUNT(movies);
  • 38. Order by • Let us question the data to illustrate the ORDER BY operation. • List all the movies in the ascending order of year. grunt> desc_movies_by_year = ORDER  movies BY year ASC; grunt> DUMP desc_movies_by_year; • List all the movies in the descending order of year. grunt> asc_movies_by_year = ORDER movies  by year DESC; grunt> DUMP asc_movies_by_year;
  • 39. Output- Ascending by year From 1985 To 2004
  • 40. Limit • Use the LIMIT keyword to get only a limited number for results from relation. grunt> top_5_movies = LIMIT movies 5; grunt> DUMP top_10_movies;
  • 41. Pig: Modes of Execution • Pig programs can be run in three methods which work in both local and MapReduce mode. They are – Script Mode – Grunt Mode – Embedded Mode
  • 42. Script mode • Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file: $ vim scriptfile.pig    A = LOAD 'script_file';    DUMP A; $ pig ­x local scriptfile.pig
  • 43. Grunt mode • Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run. $ pig ­x local grunt> A = LOAD 'grunt_file'; grunt> DUMP A; • You can also run pig scripts from grunt using run and exec commands. grunt> run scriptfile.pig grunt> exec scriptfile.pig
  • 44. Embedded mode • You can embed pig programs in Java, python and ruby and can run from the same.
  • 45. Example: Wordcount program • Q) How to find the number of occurrences of the words in a file using the pig script? • You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem. • The pig script given in next slide finds the number of times a word repeated in a file:
  • 46. Example: text file- shivneri.txt
  • 48. Output snapshot $ pig -x local forts.pig
  • 49. References • “Programming Pig” by Alan Gates, O'Reilly Publishers. • “Pig Design Patterns” by Pradeep Pasupuleti, PACKT Publishing • Tutorials Point • http://github.com/rohitdens • http://pig.apache.org
  • 50. [email protected] Thank you This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public License Blogs http://digitallocha.blogspot.in http://kyamputar.blogspot.in Web Resources http://tusharkute.com