SlideShare a Scribd company logo
BIG DATA ANALYIS TRAINING
Vibrant Technologies & Computers.
Terminology
Google calls it: Hadoop equivalent:
MapReduce Hadoop
GFS HDFS
Bigtable HBase
Chubby Zookeeper
Vibrant Technologies &
Computers.
Some MapReduce Terminology
 Job – A “full program” - an execution of a
Mapper and Reducer across a data set
 Task – An execution of a Mapper or a
Reducer on a slice of data
 a.k.a. Task-In-Progress (TIP)
 Task Attempt – A particular instance of an
attempt to execute a task on a machine
Vibrant Technologies & Computers.
Task Attempts
 A particular task will be attempted at least once,
possibly more times if it crashes
 If the same input causes crashes over and over, that input
will eventually be abandoned
 Multiple attempts at one task may occur in parallel
with speculative execution turned on
 Task ID from TaskInProgress is not a unique identifier;
don’t use it that way
Vibrant Technologies &
Computers.
MapReduce: High Level
In our case: circe.rc.usf.edu
Vibrant Technologies &
Computers.
Nodes, Trackers, Tasks
 Master node runs JobTracker instance, which
accepts Job requests from clients
 TaskTracker instances run on slave nodes
 TaskTracker forks separate Java process for
task instances
Vibrant Technologies &
Computers.
Job Distribution
 MapReduce programs are contained in a Java “jar”
file + an XML file containing serialized program
configuration options
 Running a MapReduce job places these files into
the HDFS and notifies TaskTrackers where to
retrieve the relevant program code
 … Where’s the data distribution?
Vibrant Technologies &
Computers.
Data Distribution
 Implicit in design of MapReduce!
 All mappers are equivalent; so map whatever data
is local to a particular node in HDFS
 If lots of data does happen to pile up on the
same node, nearby nodes will map instead
 Data transfer is handled implicitly by HDFS
Vibrant Technologies &
Computers.
What Happens In Hadoop?
Depth First
Vibrant Technologies &
Computers.
Job Launch Process: Client
 Client program creates a JobConf
 Identify classes implementing Mapper and
Reducer interfaces
 JobConf.setMapperClass(), setReducerClass()
 Specify inputs, outputs
 FileInputFormat.setInputPath(),
 FileOutputFormat.setOutputPath()
 Optionally, other options too:
 JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Vibrant Technologies &
Computers.
Job Launch Process: JobClient
 Pass JobConf to JobClient.runJob() or
submitJob()
 runJob() blocks, submitJob() does not
 JobClient:
 Determines proper division of input into InputSplits
 Sends job data to master JobTracker server
Vibrant Technologies &
Computers.
Job Launch Process: JobTracker
 JobTracker:
 Inserts jar and JobConf (serialized to XML) in
shared location
 Posts a JobInProgress to its run queue
Vibrant Technologies &
Computers.
Job Launch Process: TaskTracker
 TaskTrackers running on slave nodes
periodically query JobTracker for work
 Retrieve job-specific jar and config
 Launch task in separate instance of Java
 main() is provided by Hadoop
Vibrant Technologies &
Computers.
Job Launch Process: Task
 TaskTracker.Child.main():
 Sets up the child TaskInProgress attempt
 Reads XML configuration
 Connects back to necessary MapReduce
components via RPC
 Uses TaskRunner to launch user process
Vibrant Technologies &
Computers.
Job Launch Process: TaskRunner
 TaskRunner, MapTaskRunner, MapRunner
work in a daisy-chain to launch your Mapper
 Task knows ahead of time which InputSplits it
should be mapping
 Calls Mapper once for each record retrieved from
the InputSplit
 Running the Reducer is much the same
Vibrant Technologies &
Computers.
Creating the Mapper
 You provide the instance of Mapper
 Should extend MapReduceBase
 One instance of your Mapper is initialized by
the MapTaskRunner for a TaskInProgress
 Exists in separate process from all other instances
of Mapper – no data sharing!
Vibrant Technologies &
Computers.
Mapper
 void map(K1 key,
V1 value,
OutputCollector<K2, V2> output,
Reporter reporter)
 K types implement WritableComparable
 V types implement Writable
Vibrant Technologies &
Computers.
What is Writable?
 Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
 All values are instances of Writable
 All keys are instances of WritableComparable
Vibrant Technologies &
Computers.
Getting Data To The Mapper
Vibrant Technologies &
Computers.
Reading Data
 Data sets are specified by InputFormats
 Defines input data (e.g., a directory)
 Identifies partitions of the data that form an
InputSplit
 Factory for RecordReader objects to extract (k, v)
records from the input source
Vibrant Technologies &
Computers.
FileInputFormat and Friends
 TextInputFormat – Treats each ‘n’-
terminated line of a file as a value
 KeyValueTextInputFormat – Maps ‘n’-
terminated text lines of “k SEP v”
 SequenceFileInputFormat – Binary file of (k,
v) pairs with some add’l metadata
 SequenceFileAsTextInputFormat – Same,
but maps (k.toString(), v.toString())
Vibrant Technologies &
Computers.
Filtering File Inputs
 FileInputFormat will read all files out of a
specified directory and send them to the
mapper
 Delegates filtering this file list to a method
subclasses may override
 e.g., Create your own “xyzFileInputFormat” to
read *.xyz from directory list
Vibrant Technologies &
Computers.
Record Readers
 Each InputFormat provides its own
RecordReader implementation
 Provides (unused?) capability multiplexing
 LineRecordReader – Reads a line from a text
file
 KeyValueRecordReader – Used by
KeyValueTextInputFormat
Vibrant Technologies &
Computers.
Input Split Size
 FileInputFormat will divide large files into
chunks
 Exact size controlled by mapred.min.split.size
 RecordReaders receive file, offset, and
length of chunk
 Custom InputFormat implementations may
override split size – e.g., “NeverChunkFile”
Vibrant Technologies &
Computers.
Sending Data To Reducers
 Map function receives OutputCollector object
 OutputCollector.collect() takes (k, v) elements
 Any (WritableComparable, Writable) can be
used
 By default, mapper output type assumed to
be same as reducer output type
Vibrant Technologies &
Computers.
WritableComparator
 Compares WritableComparable data
 Will call WritableComparable.compare()
 Can provide fast path for serialized data
 JobConf.setOutputValueGroupingComparator()
Vibrant Technologies &
Computers.
Sending Data To The Client
 Reporter object sent to Mapper allows simple
asynchronous feedback
 incrCounter(Enum key, long amount)
 setStatus(String msg)
 Allows self-identification of input
 InputSplit getInputSplit()
Vibrant Technologies &
Computers.
Partition And Shuffle
Vibrant Technologies &
Computers.
Partitioner
 int getPartition(key, val, numPartitions)
 Outputs the partition number for a given key
 One partition == values sent to one Reduce task
 HashPartitioner used by default
 Uses key.hashCode() to return partition num
 JobConf sets Partitioner implementation
Vibrant Technologies &
Computers.
Reduction
 reduce( K2 key,
Iterator<V2> values,
OutputCollector<K3, V3> output,
Reporter reporter )
 Keys & values sent to one partition all go to
the same reduce task
 Calls are sorted by key – “earlier” keys are
reduced and output before “later” keys
Vibrant Technologies &
Computers.
Finally: Writing The Output
Vibrant Technologies &
Computers.
OutputFormat
 Analogous to InputFormat
 TextOutputFormat – Writes “key valn” strings
to output file
 SequenceFileOutputFormat – Uses a binary
format to pack (k, v) pairs
 NullOutputFormat – Discards output
 Only useful if defining own output methods within
reduce()
Vibrant Technologies &
Computers.
Example Program - Wordcount
 map()
 Receives a chunk of text
 Outputs a set of word/count pairs
 reduce()
 Receives a key and all its associated values
 Outputs the key and the sum of the values
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
Vibrant Technologies &
Computers.
Wordcount – main( )
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Vibrant Technologies &
Computers.
Wordcount – map( )
public static class Map extends MapReduceBase … {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, …) … {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Vibrant Technologies &
Computers.
Wordcount – reduce( )
public static class Reduce extends MapReduceBase … {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, …) … {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
}
Vibrant Technologies &
Computers.
Hadoop Streaming
 Allows you to create and run map/reduce
jobs with any executable
 Similar to unix pipes, e.g.:
 format is: Input | Mapper | Reducer
 echo “this sentence has five lines” | cat | wc
Vibrant Technologies &
Computers.
Hadoop Streaming
 Mapper and Reducer receive data from stdin
and output to stdout
 Hadoop takes care of the transmission of
data between the map/reduce tasks
 It is still the programmer’s responsibility to set the
correct key/value
 Default format: “key t valuen”
 Let’s look at a Python example of a
MapReduce word count program…
Vibrant Technologies &
Computers.
Streaming_Mapper.py
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
words = line.split() # list of strings
# write data on stdout
for word in words:
print ‘%st%i’ % (word, 1)
Vibrant Technologies &
Computers.
Hadoop Streaming
 What are we outputting?
 Example output: “the 1”
 By default, “the” is the key, and “1” is the value
 Hadoop Streaming handles delivering this
key/value pair to a Reducer
 Able to send similar keys to the same Reducer or
to an intermediary Combiner
Vibrant Technologies &
Computers.
Streaming_Reducer.py
wordcount = { } # empty dictionary
# read in one line of input at a time from stdin
for line in sys.stdin:
line = line.strip() # string
key,value = line.split()
wordcount[key] = wordcount.get(key, 0) + value
# write data on stdout
for word, count in sorted(wordcount.items()):
print ‘%st%i’ % (word, count)
Vibrant Technologies &
Computers.
Hadoop Streaming Gotcha
 Streaming Reducer receives single lines
(which are key/value pairs) from stdin
 Regular Reducer receives a collection of all the
values for a particular key
 It is still the case that all the values for a particular
key will go to a single Reducer
Vibrant Technologies &
Computers.
Using Hadoop Distributed File System
(HDFS)
 Can access HDFS through various shell
commands (see Further Resources slide for
link to documentation)
 hadoop –put <localsrc> … <dst>
 hadoop –get <src> <localdst>
 hadoop –ls
 hadoop –rm file
Vibrant Technologies &
Computers.
Configuring Number of Tasks
 Normal method
 jobConf.setNumMapTasks(400)
 jobConf.setNumReduceTasks(4)
 Hadoop Streaming method
 -jobconf mapred.map.tasks=400
 -jobconf mapred.reduce.tasks=4
 Note: # of map tasks is only a hint to the
framework. Actual number depends on the
number of InputSplits generated
Vibrant Technologies &
Computers.
Running a Hadoop Job
 Place input file into HDFS:
 hadoop fs –put ./input-file input-file
 Run either normal or streaming version:
 hadoop jar Wordcount.jar org.myorg.Wordcount input-file
output-file
 hadoop jar hadoop-streaming.jar 
-input input-file 
-output output-file 
-file Streaming_Mapper.py 
-mapper python Streaming_Mapper.py 
-file Streaming_Reducer.py 
-reducer python Streaming_Reducer.py 
Vibrant Technologies &
Computers.
Submitting to RC’s GridEngine
 Add appropriate modules
 module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2
 Use the submit script posted in the Further Resources slide
 Script calls internal functions hadoop_start and hadoop_end
 Adjust the lines for transferring the input file to HDFS and starting
the hadoop job using the commands on the previous slide
 Adjust the expected runtime (generally good practice to
overshoot your estimate)
 #$ -l h_rt=02:00:00
 NOTICE: “All jobs are required to have a hard run-time
specification. Jobs that do not have this specification will have a
default run-time of 10 minutes and will be stopped at that point.”
Vibrant Technologies &
Computers.
Output Parsing
 Output of the reduce tasks must be retrieved:
 hadoop fs –get output-file hadoop-output
 This creates a directory of output files, 1 per reduce
task
 Output files numbered part-00000, part-00001, etc.
 Sample output of Wordcount
 head –n5 part-00000
“’tis 1
“come 2
“coming 1
“edwin 1
“found 1
Vibrant Technologies &
Computers.
Extra Output
 The stdout/stderr streams of Hadoop itself will be stored in an output file
(whichever one is named in the startup script)
 #$ -o output.$job_id
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205
…
11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1
11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001
…
11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1
…
11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments
11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total
size: 43927 bytes
11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0%
…
11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001
Vibrant Technologies &
Computers.
Thank You
Vibrant Technologies &
Computers.

More Related Content

PPTX
Unit 3 lecture-2
PDF
Reactive Stream Processing Using DDS and Rx
PDF
Practical pairing of generative programming with functional programming.
PDF
Remote Log Analytics Using DDS, ELK, and RxJS
PDF
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PDF
Consuming and Creating Libraries in C++
ODP
Java Garbage Collection, Monitoring, and Tuning
Unit 3 lecture-2
Reactive Stream Processing Using DDS and Rx
Practical pairing of generative programming with functional programming.
Remote Log Analytics Using DDS, ELK, and RxJS
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Consuming and Creating Libraries in C++
Java Garbage Collection, Monitoring, and Tuning

What's hot (20)

PDF
Ge aviation spark application experience porting analytics into py spark ml p...
PPTX
.NET Database Toolkit
PDF
Spark schema for free with David Szakallas
PPTX
Automation Tool QTP
PDF
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
ODP
Functional programming in Javascript
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
PDF
Reactive programming with RxJava
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PDF
If You Think You Can Stay Away from Functional Programming, You Are Wrong
PPTX
Toub parallelism tour_oct2009
ODP
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
DOC
Java programming lab assignments
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PDF
Harnessing the Power of Java 8 Streams
PPT
Functional Programming Past Present Future
PDF
ACM DBPL Keynote: The Graph Traversal Machine and Language
Ge aviation spark application experience porting analytics into py spark ml p...
.NET Database Toolkit
Spark schema for free with David Szakallas
Automation Tool QTP
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
Functional programming in Javascript
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Introduction to Spark ML Pipelines Workshop
Map(), flatmap() and reduce() are your new best friends: simpler collections,...
Reactive programming with RxJava
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
If You Think You Can Stay Away from Functional Programming, You Are Wrong
Toub parallelism tour_oct2009
Agile NCR 2013- Anirudh Bhatnagar - Hadoop unit testing agile ncr
Java programming lab assignments
Hadoop France meetup Feb2016 : recommendations with spark
Harnessing the Power of Java 8 Streams
Functional Programming Past Present Future
ACM DBPL Keynote: The Graph Traversal Machine and Language
Ad

Viewers also liked (10)

DOCX
Resume
PPT
土壌微生物と土の物理性
PDF
Positions available - Mid August
PDF
WeesiesStephenAssociateDegree
PDF
SPHER OS Snapshot - v2.2
PPTX
Greek myth
PDF
Campaña #ExperienciaAlfa para Alfa Romeo
PDF
Campaña BWIN
Resume
土壌微生物と土の物理性
Positions available - Mid August
WeesiesStephenAssociateDegree
SPHER OS Snapshot - v2.2
Greek myth
Campaña #ExperienciaAlfa para Alfa Romeo
Campaña BWIN
Ad

Similar to Big-data-analysis-training-in-mumbai (20)

PPT
Hadoop_Pennonsoft
PPT
Hadoop - Introduction to mapreduce
PPTX
Hadoop training-in-hyderabad
PPTX
Map reduce in Hadoop BIG DATA ANALYTICS
PPTX
Hadoop MapReduce framework - Module 3
PDF
Hadoop first mr job - inverted index construction
PDF
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
PPTX
Mapreduce advanced
PPTX
Map reducefunnyslide
PDF
Lecture 2 part 3
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPT
hadoop.ppt
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Basic of Big Data
PPTX
MapReduce and Hadoop Introcuctory Presentation
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
Map Reduce
PPTX
Map reduce prashant
Hadoop_Pennonsoft
Hadoop - Introduction to mapreduce
Hadoop training-in-hyderabad
Map reduce in Hadoop BIG DATA ANALYTICS
Hadoop MapReduce framework - Module 3
Hadoop first mr job - inverted index construction
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
Mapreduce advanced
Map reducefunnyslide
Lecture 2 part 3
MAP REDUCE IN DATA SCIENCE.pptx
hadoop.ppt
Apache hadoop, hdfs and map reduce Overview
Basic of Big Data
MapReduce and Hadoop Introcuctory Presentation
Hadoop and Mapreduce for .NET User Group
Map Reduce
Map reduce prashant

More from Unmesh Baile (20)

PPT
java-corporate-training-institute-in-mumbai
PPT
Php mysql training-in-mumbai
PPT
Java course-in-mumbai
PPT
Robotics corporate-training-in-mumbai
PPT
Corporate-training-for-msbi-course-in-mumbai
PPT
Linux corporate-training-in-mumbai
PPT
Professional dataware-housing-training-in-mumbai
PPT
Best-embedded-corporate-training-in-mumbai
PPTX
Selenium-corporate-training-in-mumbai
PPT
Weblogic-clustering-failover-and-load-balancing-training
PPT
Advance-excel-professional-trainer-in-mumbai
PPT
Best corporate-r-programming-training-in-mumbai
PPT
R-programming-training-in-mumbai
PPT
Corporate-data-warehousing-training
PPT
Sas-training-in-mumbai
PPT
Microsoft-business-intelligence-training-in-mumbai
PPT
Linux-training-for-beginners-in-mumbai
PPT
Corporate-informatica-training-in-mumbai
PPT
Corporate-informatica-training-in-mumbai
PPT
Best-robotics-training-in-mumbai
java-corporate-training-institute-in-mumbai
Php mysql training-in-mumbai
Java course-in-mumbai
Robotics corporate-training-in-mumbai
Corporate-training-for-msbi-course-in-mumbai
Linux corporate-training-in-mumbai
Professional dataware-housing-training-in-mumbai
Best-embedded-corporate-training-in-mumbai
Selenium-corporate-training-in-mumbai
Weblogic-clustering-failover-and-load-balancing-training
Advance-excel-professional-trainer-in-mumbai
Best corporate-r-programming-training-in-mumbai
R-programming-training-in-mumbai
Corporate-data-warehousing-training
Sas-training-in-mumbai
Microsoft-business-intelligence-training-in-mumbai
Linux-training-for-beginners-in-mumbai
Corporate-informatica-training-in-mumbai
Corporate-informatica-training-in-mumbai
Best-robotics-training-in-mumbai

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Machine Learning_overview_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
1. Introduction to Computer Programming.pptx
Getting Started with Data Integration: FME Form 101
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
MIND Revenue Release Quarter 2 2025 Press Release
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine Learning_overview_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
OMC Textile Division Presentation 2021.pptx
Machine learning based COVID-19 study performance prediction
1. Introduction to Computer Programming.pptx

Big-data-analysis-training-in-mumbai

  • 1. BIG DATA ANALYIS TRAINING Vibrant Technologies & Computers.
  • 2. Terminology Google calls it: Hadoop equivalent: MapReduce Hadoop GFS HDFS Bigtable HBase Chubby Zookeeper Vibrant Technologies & Computers.
  • 3. Some MapReduce Terminology  Job – A “full program” - an execution of a Mapper and Reducer across a data set  Task – An execution of a Mapper or a Reducer on a slice of data  a.k.a. Task-In-Progress (TIP)  Task Attempt – A particular instance of an attempt to execute a task on a machine Vibrant Technologies & Computers.
  • 4. Task Attempts  A particular task will be attempted at least once, possibly more times if it crashes  If the same input causes crashes over and over, that input will eventually be abandoned  Multiple attempts at one task may occur in parallel with speculative execution turned on  Task ID from TaskInProgress is not a unique identifier; don’t use it that way Vibrant Technologies & Computers.
  • 5. MapReduce: High Level In our case: circe.rc.usf.edu Vibrant Technologies & Computers.
  • 6. Nodes, Trackers, Tasks  Master node runs JobTracker instance, which accepts Job requests from clients  TaskTracker instances run on slave nodes  TaskTracker forks separate Java process for task instances Vibrant Technologies & Computers.
  • 7. Job Distribution  MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options  Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code  … Where’s the data distribution? Vibrant Technologies & Computers.
  • 8. Data Distribution  Implicit in design of MapReduce!  All mappers are equivalent; so map whatever data is local to a particular node in HDFS  If lots of data does happen to pile up on the same node, nearby nodes will map instead  Data transfer is handled implicitly by HDFS Vibrant Technologies & Computers.
  • 9. What Happens In Hadoop? Depth First Vibrant Technologies & Computers.
  • 10. Job Launch Process: Client  Client program creates a JobConf  Identify classes implementing Mapper and Reducer interfaces  JobConf.setMapperClass(), setReducerClass()  Specify inputs, outputs  FileInputFormat.setInputPath(),  FileOutputFormat.setOutputPath()  Optionally, other options too:  JobConf.setNumReduceTasks(), JobConf.setOutputFormat()… Vibrant Technologies & Computers.
  • 11. Job Launch Process: JobClient  Pass JobConf to JobClient.runJob() or submitJob()  runJob() blocks, submitJob() does not  JobClient:  Determines proper division of input into InputSplits  Sends job data to master JobTracker server Vibrant Technologies & Computers.
  • 12. Job Launch Process: JobTracker  JobTracker:  Inserts jar and JobConf (serialized to XML) in shared location  Posts a JobInProgress to its run queue Vibrant Technologies & Computers.
  • 13. Job Launch Process: TaskTracker  TaskTrackers running on slave nodes periodically query JobTracker for work  Retrieve job-specific jar and config  Launch task in separate instance of Java  main() is provided by Hadoop Vibrant Technologies & Computers.
  • 14. Job Launch Process: Task  TaskTracker.Child.main():  Sets up the child TaskInProgress attempt  Reads XML configuration  Connects back to necessary MapReduce components via RPC  Uses TaskRunner to launch user process Vibrant Technologies & Computers.
  • 15. Job Launch Process: TaskRunner  TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper  Task knows ahead of time which InputSplits it should be mapping  Calls Mapper once for each record retrieved from the InputSplit  Running the Reducer is much the same Vibrant Technologies & Computers.
  • 16. Creating the Mapper  You provide the instance of Mapper  Should extend MapReduceBase  One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress  Exists in separate process from all other instances of Mapper – no data sharing! Vibrant Technologies & Computers.
  • 17. Mapper  void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)  K types implement WritableComparable  V types implement Writable Vibrant Technologies & Computers.
  • 18. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable Vibrant Technologies & Computers.
  • 19. Getting Data To The Mapper Vibrant Technologies & Computers.
  • 20. Reading Data  Data sets are specified by InputFormats  Defines input data (e.g., a directory)  Identifies partitions of the data that form an InputSplit  Factory for RecordReader objects to extract (k, v) records from the input source Vibrant Technologies & Computers.
  • 21. FileInputFormat and Friends  TextInputFormat – Treats each ‘n’- terminated line of a file as a value  KeyValueTextInputFormat – Maps ‘n’- terminated text lines of “k SEP v”  SequenceFileInputFormat – Binary file of (k, v) pairs with some add’l metadata  SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString()) Vibrant Technologies & Computers.
  • 22. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override  e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list Vibrant Technologies & Computers.
  • 23. Record Readers  Each InputFormat provides its own RecordReader implementation  Provides (unused?) capability multiplexing  LineRecordReader – Reads a line from a text file  KeyValueRecordReader – Used by KeyValueTextInputFormat Vibrant Technologies & Computers.
  • 24. Input Split Size  FileInputFormat will divide large files into chunks  Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk  Custom InputFormat implementations may override split size – e.g., “NeverChunkFile” Vibrant Technologies & Computers.
  • 25. Sending Data To Reducers  Map function receives OutputCollector object  OutputCollector.collect() takes (k, v) elements  Any (WritableComparable, Writable) can be used  By default, mapper output type assumed to be same as reducer output type Vibrant Technologies & Computers.
  • 26. WritableComparator  Compares WritableComparable data  Will call WritableComparable.compare()  Can provide fast path for serialized data  JobConf.setOutputValueGroupingComparator() Vibrant Technologies & Computers.
  • 27. Sending Data To The Client  Reporter object sent to Mapper allows simple asynchronous feedback  incrCounter(Enum key, long amount)  setStatus(String msg)  Allows self-identification of input  InputSplit getInputSplit() Vibrant Technologies & Computers.
  • 28. Partition And Shuffle Vibrant Technologies & Computers.
  • 29. Partitioner  int getPartition(key, val, numPartitions)  Outputs the partition number for a given key  One partition == values sent to one Reduce task  HashPartitioner used by default  Uses key.hashCode() to return partition num  JobConf sets Partitioner implementation Vibrant Technologies & Computers.
  • 30. Reduction  reduce( K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter )  Keys & values sent to one partition all go to the same reduce task  Calls are sorted by key – “earlier” keys are reduced and output before “later” keys Vibrant Technologies & Computers.
  • 31. Finally: Writing The Output Vibrant Technologies & Computers.
  • 32. OutputFormat  Analogous to InputFormat  TextOutputFormat – Writes “key valn” strings to output file  SequenceFileOutputFormat – Uses a binary format to pack (k, v) pairs  NullOutputFormat – Discards output  Only useful if defining own output methods within reduce() Vibrant Technologies & Computers.
  • 33. Example Program - Wordcount  map()  Receives a chunk of text  Outputs a set of word/count pairs  reduce()  Receives a key and all its associated values  Outputs the key and the sum of the values package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { Vibrant Technologies & Computers.
  • 34. Wordcount – main( ) public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } Vibrant Technologies & Computers.
  • 35. Wordcount – map( ) public static class Map extends MapReduceBase … { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } Vibrant Technologies & Computers.
  • 36. Wordcount – reduce( ) public static class Reduce extends MapReduceBase … { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } } Vibrant Technologies & Computers.
  • 37. Hadoop Streaming  Allows you to create and run map/reduce jobs with any executable  Similar to unix pipes, e.g.:  format is: Input | Mapper | Reducer  echo “this sentence has five lines” | cat | wc Vibrant Technologies & Computers.
  • 38. Hadoop Streaming  Mapper and Reducer receive data from stdin and output to stdout  Hadoop takes care of the transmission of data between the map/reduce tasks  It is still the programmer’s responsibility to set the correct key/value  Default format: “key t valuen”  Let’s look at a Python example of a MapReduce word count program… Vibrant Technologies & Computers.
  • 39. Streaming_Mapper.py # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string words = line.split() # list of strings # write data on stdout for word in words: print ‘%st%i’ % (word, 1) Vibrant Technologies & Computers.
  • 40. Hadoop Streaming  What are we outputting?  Example output: “the 1”  By default, “the” is the key, and “1” is the value  Hadoop Streaming handles delivering this key/value pair to a Reducer  Able to send similar keys to the same Reducer or to an intermediary Combiner Vibrant Technologies & Computers.
  • 41. Streaming_Reducer.py wordcount = { } # empty dictionary # read in one line of input at a time from stdin for line in sys.stdin: line = line.strip() # string key,value = line.split() wordcount[key] = wordcount.get(key, 0) + value # write data on stdout for word, count in sorted(wordcount.items()): print ‘%st%i’ % (word, count) Vibrant Technologies & Computers.
  • 42. Hadoop Streaming Gotcha  Streaming Reducer receives single lines (which are key/value pairs) from stdin  Regular Reducer receives a collection of all the values for a particular key  It is still the case that all the values for a particular key will go to a single Reducer Vibrant Technologies & Computers.
  • 43. Using Hadoop Distributed File System (HDFS)  Can access HDFS through various shell commands (see Further Resources slide for link to documentation)  hadoop –put <localsrc> … <dst>  hadoop –get <src> <localdst>  hadoop –ls  hadoop –rm file Vibrant Technologies & Computers.
  • 44. Configuring Number of Tasks  Normal method  jobConf.setNumMapTasks(400)  jobConf.setNumReduceTasks(4)  Hadoop Streaming method  -jobconf mapred.map.tasks=400  -jobconf mapred.reduce.tasks=4  Note: # of map tasks is only a hint to the framework. Actual number depends on the number of InputSplits generated Vibrant Technologies & Computers.
  • 45. Running a Hadoop Job  Place input file into HDFS:  hadoop fs –put ./input-file input-file  Run either normal or streaming version:  hadoop jar Wordcount.jar org.myorg.Wordcount input-file output-file  hadoop jar hadoop-streaming.jar -input input-file -output output-file -file Streaming_Mapper.py -mapper python Streaming_Mapper.py -file Streaming_Reducer.py -reducer python Streaming_Reducer.py Vibrant Technologies & Computers.
  • 46. Submitting to RC’s GridEngine  Add appropriate modules  module add apps/jdk/1.6.0_22.x86_64 apps/hadoop/0.20.2  Use the submit script posted in the Further Resources slide  Script calls internal functions hadoop_start and hadoop_end  Adjust the lines for transferring the input file to HDFS and starting the hadoop job using the commands on the previous slide  Adjust the expected runtime (generally good practice to overshoot your estimate)  #$ -l h_rt=02:00:00  NOTICE: “All jobs are required to have a hard run-time specification. Jobs that do not have this specification will have a default run-time of 10 minutes and will be stopped at that point.” Vibrant Technologies & Computers.
  • 47. Output Parsing  Output of the reduce tasks must be retrieved:  hadoop fs –get output-file hadoop-output  This creates a directory of output files, 1 per reduce task  Output files numbered part-00000, part-00001, etc.  Sample output of Wordcount  head –n5 part-00000 “’tis 1 “come 2 “coming 1 “edwin 1 “found 1 Vibrant Technologies & Computers.
  • 48. Extra Output  The stdout/stderr streams of Hadoop itself will be stored in an output file (whichever one is named in the startup script)  #$ -o output.$job_id STARTUP_MSG: Starting NameNode STARTUP_MSG: host = svc-3024-8-10.rc.usf.edu/10.250.4.205 … 11/03/02 18:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1 11/03/02 18:28:47 INFO mapred.JobClient: Running job: job_local_0001 … 11/03/02 18:28:48 INFO mapred.MapTask: numReduceTasks: 1 … 11/03/02 18:28:48 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 11/03/02 18:28:48 INFO mapred.Merger: Merging 1 sorted segments 11/03/02 18:28:48 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43927 bytes 11/03/02 18:28:48 INFO mapred.JobClient: map 100% reduce 0% … 11/03/02 18:28:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 11/03/02 18:28:49 INFO mapred.JobClient: Job complete: job_local_0001 Vibrant Technologies & Computers.