SlideShare a Scribd company logo
Prepared by,
Vetri.V
What is Pig?
 A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and Map/Reduce clusters.
 Pig is a scripting language.
 No compiler
 Rapid prototyping.
 Command line prompt (grunt shell)
 Pig is a domain specific language
 No control flow (no if/then/else
 Specific to data flows
 Not for writing ray tracers
 For the distribution of a pre-existing ray tracer.
What isn’t pig?
 A general framework for all distributed computation.
 Pig is map/reduce, just easier.
 A general purpose language.
 No scope
 Minimal variable support
 No control flow
Why we need Pig?
 Writing native Map/Reduce is hard
 Difficult to make abstractions
 Extremely verbose
 400lines of java becomes<30 lines of pig
 Joins are very difficult
 a big motivator for pig
 Basically, everything about java M/R is painful.
Pig has two execution types or modes:
 Local mode and Map/Reduce mode.
Local Mode:
 To run the scripts in local mode, no Hadoop or HDFS installation is required.
All files are installed and run from your local host and file system.
Map/reduce Mode:
 To run the scripts in map/reduce mode, you need access to a Hadoop cluster
and HDFS installation.
Local mode
 In local mode, Pig runs in a single JVM and accesses the local file system. This
mode is suitable only for small datasets and when trying out Pig.
Prepared by,
Vetri.V
Running the Pig Scripts in Local Mode:
 To run the Pig scripts in local mode, do the following:
1. Move to the pig_tmp directory.
2. Execute the following command using script1-local.pig (or script2-
local.pig).
$ pig -x local script1-local.pig
The output may contain a few Hadoop warnings which can be ignored:
3. A directory named script1-local-results.txt (or script2-local-results.txt) is
created. This directory contains the results file, part-r-0000.
Running the Pig Scripts in Map/reduce Mode:
 To run the Pig scripts in mapreduce mode, do the following:
1. Move to the pig_ tmp directory.
2. Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory.
$ hadoopfs –copyFromLocal excite.log.bz2 .
PIG INSTALLATIONS:
 Before install pig make sure ant installed on the machine.
 Step 1:
Download Pig tarball
 Step 2:
Untar Pig
 Step 3:
Set environment Variable HADOOP_HOME and HADOOP_CONF_DIR
 Step 4:
$cd /usr/local/pig
 Step 5:
$ant
 Step 6:
$cd contrib/piggybank/java
 Step 7:
$ant
This will generate /usr/local/pig/contrib/piggybank/java/piggybank.jar.
Test your Pig Installations:
$ export PIG_HOME=/usr/local/pig
$ export HADOOP_HOME=/opt/hadoop
$ pig
grunt> copyFromLocal /etc/passwd /tmp/passwd
grunt> A = load '/tmp/passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> DUMP B;
(root)
(bin)
(daemon)
(adm)
Prepared by,
Vetri.V
Data Model- Pig:
 This includes Pig's data types, how it handles concepts like missing data, and
how you can describe your data to Pig.
Data types:
 It has two types.
 Scalar types.
 Int (ints are represented in interface by java.lang.Integer)
 Long( long are represented in interface by java.lang.Long)
 Float (float are represented in interface by java.lang.Float)
 Double
 Chararray (chararray are represented in interface by
java.lang.String)
 Bytearray
 Complex types.
 Complex Types: Tuples
o Every row in a relation is a tuple.
o Allows random access.
o Wraps ArrayList<Object>
o Must fit in memory.
o Can have a schema, but is not enforced.
 Potential for optimization.
 Complex types: Bags
o Pig’s only spillable data structure.
 Full structure does not have to fit in memory
o Two key operations
 Add a tuple.
 Iterate over Tuples
o No random access.
o No order guarantees.
o The object version of a relation.
 Every row is a tuple.
 Complex types: Maps
o Wraps HashMap<String,Object>
 Keys must be string.
o Must fit in memory
o Can be cumbersome to use.
 Vale type poorly understood in script.
Prepared by,
Vetri.V
Data types and syntax:
Datatype Syntax Example
int int as (a:int)
long long as (a:long)
float float as (a:float)
double double as (a:double)
chararray chararray as
(a:chararray)
bytearray bytearray as
(a:bytearray)
map map[] or map[type] where type is any valid type. This
declares all values in the map to be of this type.
as (a:map[],
b:map[int])
tuple tuple() or tuple(list_of_fields) where list_of_fieldsis a comma
separated list of field declarations.
as (a:tuple(),
b:tuple(x:int,
y:int))
bag bag{} or bag{t:(list_of_fields)} where list_of_fieldsis a comma
separated list of field declarations. Note that, oddly enough,
the tuple inside the bag must have a name, here specified as t,
even though you will never be able to access that
tuple t directly.
(a:bag{},
b:bag{t:(x:int,
y:int)})
Understanding a basic pig script:
Prepared by,
Vetri.V
Examples: Word Count:
The Pig object model:
Relations:
 Fundamental building block
 Analogous to a table, not a variable.
Prepared by,
Vetri.V
A detour: firing up pig:
Loading data:
Projection (aka FOREACH):
Prepared by,
Vetri.V
 Foreach means do something on every row in a relation.
 Creates a new relation.
 In this example, pruned is a new relation whose columns will be first name
age.
 With schemas, can also use column aliases.
Nested FOREACH:
 An advanced, but extremely powerful use of FOREACH let’s a script do more
analysis on the reducer.
 Imagine we wanted the distinct number of ages per department.
Schemas and types:
 Schema-less analysis is useful.
 Many data sources don’t have clear types.
 But schemas are also very useful.
 Ensuring correctness.
 Aiding optimization
 Schema gives an alias and type to the columns.
 Absent columns will be made null.
 Extra columns will be thrown out.
DESCRIBE- prints the schema of a relation:
Prepared by,
Vetri.V
Schemas vs. types:
 Schemas:
 A description of the types present.
 Used to help maintain correctness.
 Generally not enforced once script is run.
 Types:
 Describes the data present in a column.
 Generally parallel java types.
Type overview:
 Pig has a nested object model.
 Nested types
 Complex objects can contain other object.
 Pig primitives mirror java primitives.
 String, Int, Long, Float, Double.
 DataByteArray wraps a byte[].
 Working to add native Date Time support, and more.
Filter:
 A predicate is evaluated for each row.
 If false, the row is thrown out.
 Supports complicated predicates
 Boolean logic
 Regular expressions
 See Pig documentation for more.
Grouping:
 Abstractly, GROUPING creates a relation with unique keys, and the associated
rows.
 Examples: how many people in our data set are in each department?
Prepared by,
Vetri.V
Visualizing the group:
USING GROUP: an example:
 Goal: what % does each age group make of the total?
Prepared by,
Vetri.V
Understanding group
Groups: a retrospective:
 Grouping does not change the data.
 Reorganizes it based on the given key.
 Can group on multiple key.
 First column is always called group.
 A compound group key will be a tuple (“group”) whose elements are
the keys.
 Second column is bag.
 Name is the grouped relation
 Contains every row associated with key.
Flattening:
 Flatten is the opposite of group.
 Turns tuples into columns.
 Turns bags into rows.
Prepared by,
Vetri.V
Flattening Tuples(cont):
Flattening Bags:
 Syntax is the same as flattening tuples, but the idea is different.
 Tuples contain columns, thus flattening a tuple turn one column into many
columns.
 Bag contains, so flattening a bag turns one row into many rows.
Prepared by,
Vetri.V
Data is the same, just with different nesting. On the left, the rows are divided into
different bags.
Flattening Bags (cont):
 The schema indicates what is going on.
 Group goes from a flat structure to a nested one.
 Flatten goes from a nested structure to a flat one.
 Now that we have seen grouping, there’s a useful operation.
Prepared by,
Vetri.V
Fun with flattens:
 The group example can be done with flattens.
Prepared by,
Vetri.V
Flattening multiple Bags:
 The result from multiple flatten statements will be crossed.
 To only select a few columns in a Bag, syntax is bag_alias.(col1, coli2)
Joins:
 A big motivator for pig easier joins.
 Compares relation using a given key.
 Output all combinations of rows with equal keys.
 See appendix for more variations.
 How many credits does each student need?
 Joined schema is concatenation of joined relations schema.
 Relation name appended to aliases in case of ambiguity.
 In this case, there are two “dept” aliases.
Prepared by,
Vetri.V
Order by:
 Order by globally sorts a relation on a key (or set of keys).
 Global sort not guaranteed to be preserved through other transformations.
 A store after a global sort will result in one or more globally sorted part files.
Extending pig: UDFs
 UDF’s coupled with pig’s object model, allow for extensive transformation and
analysis.
What is a UDF?
 A user defined function (UDF) is a java function implementing EvalFunc<T>,
and can be used in a Pig script.
 Additional support for functions in python, Ruby, groovy.
 Much of the core Pig functionality is actually implemented in UDFs.
 COUNT in the previous example.
 Useful for learning how to implement your own.
 Src/org/apache/pig/built has many examples.
Types of UDFs:
 EvalFunc<T>
 Simple, one to one functions.
 Accumulator<T>
 Many to one.
 Left associative, NOT commutative.
 Algebraic<T>
 Many to one
Prepared by,
Vetri.V
 Associative, commutative.
 Makes use of combiners.
 All UDFs must returns pig types.
 Even intermediate stages.
EvalFunc <T>:
 Simplest kind of UDF.
 Only need to implement an “exe” function.
 Not ideal for “many to one “functions that vastly reduce amount of data ( such
as SUM or COUNT).
 In these cases, Algebraics are superior.
 Src/org/apache/pig/builtin/TOKENIZE.java is a nontrivial example.
A basic UDF:
Accumulator <T>:
 Used when the input is a large bag, but order matters.
 Allow you to work on the bag incrementally, can be much more
memory efficient.
 Difference between Algebraic UDFs is generally that you need to work on the
data in a given order.
 Used for session analysis when you need to analyze events in the
order they actually occurred.
 Src/org/apache/pig/builtin/COUNT.java is an example.
 Also implements algebraic (most Algebraic functions are also
Accumulative)
Algebraic:
 Commutative, algebraic functions.
 You can apply the function to any subset of the data (even partial
results) in any order.
 The most efficient.
 Takes advantage of hadoop combiners.
 Also the most complicated.
 Src/org/apache/pig/builtin/COUNT.java is an example.
Prepared by,
Vetri.V
What is map/Reduce barrier?
 A Map/Reduce barrier is a part of a script that forces a reduce stage.
 Some scripts can be done with just mappers.
Students=Load’students.txt as
(first:chararray,last:chararray,age:int,dept:chararray);
Students_filtered=FILTER students BY age>=20;
Students_proj=FOREACH students_filtered GENERATE last,dept;
 But most will need the full map/reduce cycle.
 The group is the difference, a “map reduce barrier” which
requires a reduce step.
Map/Reduce implications of operators:
 What will cause a map/reduce job?
 GROUP and COGROUP
 JOIN
 Excluding replicated join.
 CROSS
 To be avoided unless you are absolutely certain.
 Potential for huge explosion in data
 ORDER
 DISTINCT
 What will cause multiple map reduce jobs?
 Multiple uses of the above operations.
 Forking code paths.
 First step is identifying the M/R barriers.
Job 1:
Job 2:
Prepared by,
Vetri.V
Job 3:
A DAG example:
Projections:
 Projection reduces the amount of data being processed.
 Especially important b/w map and reduce stages when data goes
over the network.
Scalar projection:
 All interactions and transformations is in Pig is done on relations.
 Sometimes, we want access to an aggregate.
 Scalar projection allows us to use intermediate aggregate results in a
script.
Prepared by,
Vetri.V
SUM, COUNT, COUNT_STAR:
 In general, SUM, COUNT, and other aggregates implicitly work on the first
column.
 COUNT counts only non-null fields.
 COUNT_STAR counts all fields.
Sorting:
 Sorting is a global operation, but can be distributed.
 Must approximate distribution of the sort key.
 Imagine evenly distributed data between 1 and 100. With 10 reducers, can
send 1-10 to computer 1,11-20 to computer 2, and so on.
 In this way, the computation is distributed but the sort is global.
 Pig inserts a sorting job before an order by to estimate the key distribution.
Spilling:
 Spilling means that, at any time, a data structure can be asked to write itself to
disk.
 In Pig, there is a memory useage threshold
 This is why you can only add to bags, or iterate on them.
 Adding could force a spill to disk.
 Iterating can mean having to go disk for the contents.
JOIN OPTIMIZATIONS:
 Pig has three join optimizations. Using them can potentially makes jobs run
MUCH faster.
 Replicated join
 -a=join rel1 by x, rel2 by using ‘replicated’;
 Skewed join
 -a=join rel1 by x, rel2 by using ‘skewed’;
Prepared by,
Vetri.V
 Merge join.
 -a=join rel1 by x, rel2 by using ‘merge;
 Replicated join:
 Can be used when.
 Every relation besides the left-most relation can fit in
memory.
 Will invoke a map-side join.
 Will load all other relations into memory in the mapper and
do the join in place.
 Where applicable, massive resource savings.
 Skewed join:
 Useful when one of the relations being joined has a key which
dominates.
 Web logs, for example, often have a logged out user id which
can be a large % of the keys.
 The algorithm first samples the key distribution, and then replicates
the most popular keys.
 Some overhead, but worth it in cases of bad skew.
 Only works if there is skew in one relation
 If both relations have skew, the join degenerates to a cross,
which is unavoidable.
 Merge join:
 This is useful when you have relations that are already ordered.
 Cutting edge let’s you put an “order by” before the merge join.
 Will index the blocks that correspond to the relations, then will do a
traditional merge algorithm.
 Huge savings when applicable.
What happens when run a script?
 First, pig parses your script using ANTLR.
 The parser creates an intermediate representation (AST).
 The AST is converted to a logical plan.
 The logical plan is optimized, and then converts to a physical plan.
 The physical plan is optimized, and then converted to a series of Map/Reduce
jobs.
JOIN (inner)
 Performs inner, equijoin of two or more relations based on common field
values.
Prepared by,
Vetri.V
Example
 Suppose we have relations A and B.
 A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
 B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
 X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
JOIN (outer)
 Performs an outer join of two or more relations based on common field
values.
 It operates left outer and right outer and full outer join. The usage keywords
are namely LEFT OUTER, RIGHT OUTER, FULL OUTER.
Example:
 Left outer join
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
Prepared by,
Vetri.V
LIMIT
 Limits the number of output tuples.
 Usage
 X = LIMIT A 3; (where x -relation and 3 is the number of tuple)
--- Thank you ---

More Related Content

PDF
Hive
PDF
Hbase
PPT
Unit 5-lecture4
PPTX
Unit 5-lecture-3
PPTX
03 hive query language (hql)
PPTX
Advanced topics in hive
PDF
Hadoop first mr job - inverted index construction
PPTX
Unit 4 lecture-3
Hive
Hbase
Unit 5-lecture4
Unit 5-lecture-3
03 hive query language (hql)
Advanced topics in hive
Hadoop first mr job - inverted index construction
Unit 4 lecture-3

What's hot (19)

PPTX
Hive : WareHousing Over hadoop
PPT
Session 19 - MapReduce
PDF
Inside Parquet Format
PPTX
Hive commands
PPTX
02 data warehouse applications with hive
PPTX
Hive and HiveQL - Module6
PDF
R stata
PDF
Introduction to scoop and its functions
PPTX
Session 04 pig - slides
PDF
Working with Hive Analytics
PDF
Import and Export Big Data using R Studio
PPTX
Map reduce prashant
PDF
Import web resources using R Studio
PPTX
Introduction to HBase - Phoenix HUG 5/14
PPT
Hive Apachecon 2008
PDF
20081030linkedin
PDF
White paper on cassandra
PDF
Introductive to Hive
PPTX
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Hive : WareHousing Over hadoop
Session 19 - MapReduce
Inside Parquet Format
Hive commands
02 data warehouse applications with hive
Hive and HiveQL - Module6
R stata
Introduction to scoop and its functions
Session 04 pig - slides
Working with Hive Analytics
Import and Export Big Data using R Studio
Map reduce prashant
Import web resources using R Studio
Introduction to HBase - Phoenix HUG 5/14
Hive Apachecon 2008
20081030linkedin
White paper on cassandra
Introductive to Hive
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Ad

Similar to Pig (20)

PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PPTX
PDF
220 runtime environments
DOCX
What is c language
PDF
Unit V.pdf
PPT
Scala Talk at FOSDEM 2009
PPTX
Pig power tools_by_viswanath_gangavaram
PPTX
Unit-5 [Pig] working and architecture.pptx
PDF
Modules of the twenties
PDF
C interview-questions-techpreparation
PPTX
Technical Interview
PPT
Andy On Closures
PPT
Future Programming Language
DOCX
C interview question answer 1
PDF
Python interview questions and answers
DOCX
Python interview questions and answers
DOC
1183 c-interview-questions-and-answers
PDF
Python and Zope: An introduction (May 2004)
PPTX
Embedding Pig in scripting languages
PPT
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
220 runtime environments
What is c language
Unit V.pdf
Scala Talk at FOSDEM 2009
Pig power tools_by_viswanath_gangavaram
Unit-5 [Pig] working and architecture.pptx
Modules of the twenties
C interview-questions-techpreparation
Technical Interview
Andy On Closures
Future Programming Language
C interview question answer 1
Python interview questions and answers
Python interview questions and answers
1183 c-interview-questions-and-answers
Python and Zope: An introduction (May 2004)
Embedding Pig in scripting languages
Ad

Recently uploaded (20)

PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Mushroom cultivation and it's methods.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
August Patch Tuesday
PPTX
TLE Review Electricity (Electricity).pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
DP Operators-handbook-extract for the Mautical Institute
cloud_computing_Infrastucture_as_cloud_p
Mushroom cultivation and it's methods.pdf
Approach and Philosophy of On baking technology
Univ-Connecticut-ChatGPT-Presentaion.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Web App vs Mobile App What Should You Build First.pdf
Group 1 Presentation -Planning and Decision Making .pptx
1. Introduction to Computer Programming.pptx
Unlocking AI with Model Context Protocol (MCP)
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
August Patch Tuesday
TLE Review Electricity (Electricity).pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
1 - Historical Antecedents, Social Consideration.pdf
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Zenith AI: Advanced Artificial Intelligence

Pig

  • 1. Prepared by, Vetri.V What is Pig?  A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and Map/Reduce clusters.  Pig is a scripting language.  No compiler  Rapid prototyping.  Command line prompt (grunt shell)  Pig is a domain specific language  No control flow (no if/then/else  Specific to data flows  Not for writing ray tracers  For the distribution of a pre-existing ray tracer. What isn’t pig?  A general framework for all distributed computation.  Pig is map/reduce, just easier.  A general purpose language.  No scope  Minimal variable support  No control flow Why we need Pig?  Writing native Map/Reduce is hard  Difficult to make abstractions  Extremely verbose  400lines of java becomes<30 lines of pig  Joins are very difficult  a big motivator for pig  Basically, everything about java M/R is painful. Pig has two execution types or modes:  Local mode and Map/Reduce mode. Local Mode:  To run the scripts in local mode, no Hadoop or HDFS installation is required. All files are installed and run from your local host and file system. Map/reduce Mode:  To run the scripts in map/reduce mode, you need access to a Hadoop cluster and HDFS installation. Local mode  In local mode, Pig runs in a single JVM and accesses the local file system. This mode is suitable only for small datasets and when trying out Pig.
  • 2. Prepared by, Vetri.V Running the Pig Scripts in Local Mode:  To run the Pig scripts in local mode, do the following: 1. Move to the pig_tmp directory. 2. Execute the following command using script1-local.pig (or script2- local.pig). $ pig -x local script1-local.pig The output may contain a few Hadoop warnings which can be ignored: 3. A directory named script1-local-results.txt (or script2-local-results.txt) is created. This directory contains the results file, part-r-0000. Running the Pig Scripts in Map/reduce Mode:  To run the Pig scripts in mapreduce mode, do the following: 1. Move to the pig_ tmp directory. 2. Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory. $ hadoopfs –copyFromLocal excite.log.bz2 . PIG INSTALLATIONS:  Before install pig make sure ant installed on the machine.  Step 1: Download Pig tarball  Step 2: Untar Pig  Step 3: Set environment Variable HADOOP_HOME and HADOOP_CONF_DIR  Step 4: $cd /usr/local/pig  Step 5: $ant  Step 6: $cd contrib/piggybank/java  Step 7: $ant This will generate /usr/local/pig/contrib/piggybank/java/piggybank.jar. Test your Pig Installations: $ export PIG_HOME=/usr/local/pig $ export HADOOP_HOME=/opt/hadoop $ pig grunt> copyFromLocal /etc/passwd /tmp/passwd grunt> A = load '/tmp/passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> DUMP B; (root) (bin) (daemon) (adm)
  • 3. Prepared by, Vetri.V Data Model- Pig:  This includes Pig's data types, how it handles concepts like missing data, and how you can describe your data to Pig. Data types:  It has two types.  Scalar types.  Int (ints are represented in interface by java.lang.Integer)  Long( long are represented in interface by java.lang.Long)  Float (float are represented in interface by java.lang.Float)  Double  Chararray (chararray are represented in interface by java.lang.String)  Bytearray  Complex types.  Complex Types: Tuples o Every row in a relation is a tuple. o Allows random access. o Wraps ArrayList<Object> o Must fit in memory. o Can have a schema, but is not enforced.  Potential for optimization.  Complex types: Bags o Pig’s only spillable data structure.  Full structure does not have to fit in memory o Two key operations  Add a tuple.  Iterate over Tuples o No random access. o No order guarantees. o The object version of a relation.  Every row is a tuple.  Complex types: Maps o Wraps HashMap<String,Object>  Keys must be string. o Must fit in memory o Can be cumbersome to use.  Vale type poorly understood in script.
  • 4. Prepared by, Vetri.V Data types and syntax: Datatype Syntax Example int int as (a:int) long long as (a:long) float float as (a:float) double double as (a:double) chararray chararray as (a:chararray) bytearray bytearray as (a:bytearray) map map[] or map[type] where type is any valid type. This declares all values in the map to be of this type. as (a:map[], b:map[int]) tuple tuple() or tuple(list_of_fields) where list_of_fieldsis a comma separated list of field declarations. as (a:tuple(), b:tuple(x:int, y:int)) bag bag{} or bag{t:(list_of_fields)} where list_of_fieldsis a comma separated list of field declarations. Note that, oddly enough, the tuple inside the bag must have a name, here specified as t, even though you will never be able to access that tuple t directly. (a:bag{}, b:bag{t:(x:int, y:int)}) Understanding a basic pig script:
  • 5. Prepared by, Vetri.V Examples: Word Count: The Pig object model: Relations:  Fundamental building block  Analogous to a table, not a variable.
  • 6. Prepared by, Vetri.V A detour: firing up pig: Loading data: Projection (aka FOREACH):
  • 7. Prepared by, Vetri.V  Foreach means do something on every row in a relation.  Creates a new relation.  In this example, pruned is a new relation whose columns will be first name age.  With schemas, can also use column aliases. Nested FOREACH:  An advanced, but extremely powerful use of FOREACH let’s a script do more analysis on the reducer.  Imagine we wanted the distinct number of ages per department. Schemas and types:  Schema-less analysis is useful.  Many data sources don’t have clear types.  But schemas are also very useful.  Ensuring correctness.  Aiding optimization  Schema gives an alias and type to the columns.  Absent columns will be made null.  Extra columns will be thrown out. DESCRIBE- prints the schema of a relation:
  • 8. Prepared by, Vetri.V Schemas vs. types:  Schemas:  A description of the types present.  Used to help maintain correctness.  Generally not enforced once script is run.  Types:  Describes the data present in a column.  Generally parallel java types. Type overview:  Pig has a nested object model.  Nested types  Complex objects can contain other object.  Pig primitives mirror java primitives.  String, Int, Long, Float, Double.  DataByteArray wraps a byte[].  Working to add native Date Time support, and more. Filter:  A predicate is evaluated for each row.  If false, the row is thrown out.  Supports complicated predicates  Boolean logic  Regular expressions  See Pig documentation for more. Grouping:  Abstractly, GROUPING creates a relation with unique keys, and the associated rows.  Examples: how many people in our data set are in each department?
  • 9. Prepared by, Vetri.V Visualizing the group: USING GROUP: an example:  Goal: what % does each age group make of the total?
  • 10. Prepared by, Vetri.V Understanding group Groups: a retrospective:  Grouping does not change the data.  Reorganizes it based on the given key.  Can group on multiple key.  First column is always called group.  A compound group key will be a tuple (“group”) whose elements are the keys.  Second column is bag.  Name is the grouped relation  Contains every row associated with key. Flattening:  Flatten is the opposite of group.  Turns tuples into columns.  Turns bags into rows.
  • 11. Prepared by, Vetri.V Flattening Tuples(cont): Flattening Bags:  Syntax is the same as flattening tuples, but the idea is different.  Tuples contain columns, thus flattening a tuple turn one column into many columns.  Bag contains, so flattening a bag turns one row into many rows.
  • 12. Prepared by, Vetri.V Data is the same, just with different nesting. On the left, the rows are divided into different bags. Flattening Bags (cont):  The schema indicates what is going on.  Group goes from a flat structure to a nested one.  Flatten goes from a nested structure to a flat one.  Now that we have seen grouping, there’s a useful operation.
  • 13. Prepared by, Vetri.V Fun with flattens:  The group example can be done with flattens.
  • 14. Prepared by, Vetri.V Flattening multiple Bags:  The result from multiple flatten statements will be crossed.  To only select a few columns in a Bag, syntax is bag_alias.(col1, coli2) Joins:  A big motivator for pig easier joins.  Compares relation using a given key.  Output all combinations of rows with equal keys.  See appendix for more variations.  How many credits does each student need?  Joined schema is concatenation of joined relations schema.  Relation name appended to aliases in case of ambiguity.  In this case, there are two “dept” aliases.
  • 15. Prepared by, Vetri.V Order by:  Order by globally sorts a relation on a key (or set of keys).  Global sort not guaranteed to be preserved through other transformations.  A store after a global sort will result in one or more globally sorted part files. Extending pig: UDFs  UDF’s coupled with pig’s object model, allow for extensive transformation and analysis. What is a UDF?  A user defined function (UDF) is a java function implementing EvalFunc<T>, and can be used in a Pig script.  Additional support for functions in python, Ruby, groovy.  Much of the core Pig functionality is actually implemented in UDFs.  COUNT in the previous example.  Useful for learning how to implement your own.  Src/org/apache/pig/built has many examples. Types of UDFs:  EvalFunc<T>  Simple, one to one functions.  Accumulator<T>  Many to one.  Left associative, NOT commutative.  Algebraic<T>  Many to one
  • 16. Prepared by, Vetri.V  Associative, commutative.  Makes use of combiners.  All UDFs must returns pig types.  Even intermediate stages. EvalFunc <T>:  Simplest kind of UDF.  Only need to implement an “exe” function.  Not ideal for “many to one “functions that vastly reduce amount of data ( such as SUM or COUNT).  In these cases, Algebraics are superior.  Src/org/apache/pig/builtin/TOKENIZE.java is a nontrivial example. A basic UDF: Accumulator <T>:  Used when the input is a large bag, but order matters.  Allow you to work on the bag incrementally, can be much more memory efficient.  Difference between Algebraic UDFs is generally that you need to work on the data in a given order.  Used for session analysis when you need to analyze events in the order they actually occurred.  Src/org/apache/pig/builtin/COUNT.java is an example.  Also implements algebraic (most Algebraic functions are also Accumulative) Algebraic:  Commutative, algebraic functions.  You can apply the function to any subset of the data (even partial results) in any order.  The most efficient.  Takes advantage of hadoop combiners.  Also the most complicated.  Src/org/apache/pig/builtin/COUNT.java is an example.
  • 17. Prepared by, Vetri.V What is map/Reduce barrier?  A Map/Reduce barrier is a part of a script that forces a reduce stage.  Some scripts can be done with just mappers. Students=Load’students.txt as (first:chararray,last:chararray,age:int,dept:chararray); Students_filtered=FILTER students BY age>=20; Students_proj=FOREACH students_filtered GENERATE last,dept;  But most will need the full map/reduce cycle.  The group is the difference, a “map reduce barrier” which requires a reduce step. Map/Reduce implications of operators:  What will cause a map/reduce job?  GROUP and COGROUP  JOIN  Excluding replicated join.  CROSS  To be avoided unless you are absolutely certain.  Potential for huge explosion in data  ORDER  DISTINCT  What will cause multiple map reduce jobs?  Multiple uses of the above operations.  Forking code paths.  First step is identifying the M/R barriers. Job 1: Job 2:
  • 18. Prepared by, Vetri.V Job 3: A DAG example: Projections:  Projection reduces the amount of data being processed.  Especially important b/w map and reduce stages when data goes over the network. Scalar projection:  All interactions and transformations is in Pig is done on relations.  Sometimes, we want access to an aggregate.  Scalar projection allows us to use intermediate aggregate results in a script.
  • 19. Prepared by, Vetri.V SUM, COUNT, COUNT_STAR:  In general, SUM, COUNT, and other aggregates implicitly work on the first column.  COUNT counts only non-null fields.  COUNT_STAR counts all fields. Sorting:  Sorting is a global operation, but can be distributed.  Must approximate distribution of the sort key.  Imagine evenly distributed data between 1 and 100. With 10 reducers, can send 1-10 to computer 1,11-20 to computer 2, and so on.  In this way, the computation is distributed but the sort is global.  Pig inserts a sorting job before an order by to estimate the key distribution. Spilling:  Spilling means that, at any time, a data structure can be asked to write itself to disk.  In Pig, there is a memory useage threshold  This is why you can only add to bags, or iterate on them.  Adding could force a spill to disk.  Iterating can mean having to go disk for the contents. JOIN OPTIMIZATIONS:  Pig has three join optimizations. Using them can potentially makes jobs run MUCH faster.  Replicated join  -a=join rel1 by x, rel2 by using ‘replicated’;  Skewed join  -a=join rel1 by x, rel2 by using ‘skewed’;
  • 20. Prepared by, Vetri.V  Merge join.  -a=join rel1 by x, rel2 by using ‘merge;  Replicated join:  Can be used when.  Every relation besides the left-most relation can fit in memory.  Will invoke a map-side join.  Will load all other relations into memory in the mapper and do the join in place.  Where applicable, massive resource savings.  Skewed join:  Useful when one of the relations being joined has a key which dominates.  Web logs, for example, often have a logged out user id which can be a large % of the keys.  The algorithm first samples the key distribution, and then replicates the most popular keys.  Some overhead, but worth it in cases of bad skew.  Only works if there is skew in one relation  If both relations have skew, the join degenerates to a cross, which is unavoidable.  Merge join:  This is useful when you have relations that are already ordered.  Cutting edge let’s you put an “order by” before the merge join.  Will index the blocks that correspond to the relations, then will do a traditional merge algorithm.  Huge savings when applicable. What happens when run a script?  First, pig parses your script using ANTLR.  The parser creates an intermediate representation (AST).  The AST is converted to a logical plan.  The logical plan is optimized, and then converts to a physical plan.  The physical plan is optimized, and then converted to a series of Map/Reduce jobs. JOIN (inner)  Performs inner, equijoin of two or more relations based on common field values.
  • 21. Prepared by, Vetri.V Example  Suppose we have relations A and B.  A = LOAD 'data1' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)  B = LOAD 'data2' AS (b1:int,b2:int); DUMP B; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9)  X = JOIN A BY a1, B BY b1; DUMP X; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9) JOIN (outer)  Performs an outer join of two or more relations based on common field values.  It operates left outer and right outer and full outer join. The usage keywords are namely LEFT OUTER, RIGHT OUTER, FULL OUTER. Example:  Left outer join A = LOAD 'a.txt' AS (n:chararray, a:int); B = LOAD 'b.txt' AS (n:chararray, m:chararray); C = JOIN A by $0 LEFT OUTER, B BY $0;
  • 22. Prepared by, Vetri.V LIMIT  Limits the number of output tuples.  Usage  X = LIMIT A 3; (where x -relation and 3 is the number of tuple) --- Thank you ---