SlideShare a Scribd company logo
Mapreduce Programming at
Comet and HW2 Log Analysis
UCSB CS240A 2016. Tao Yang
2
Data Analysis from Web Server Logs
Startup code and data : /home/tyang/cs240sample/log
apache1.splunk.com
apache2.splunk.com
apache3.splunk.com
02/09/2010
3
Example line of the log file
10.32.1.43 - - [06/Feb/2013:00:07:00] "GET
/flower_store/product.screen?product_id=FL-DLH-02
HTTP/1.1" 200 10901
"http://mystore.splunk.com/flower_store/category.screen
?category_id=GIFTS&JSESSIONID=SD7SL1FF9ADFF2
" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10)
Gecko/20070223 CentOS/1.5.0.10-0.1.el4.centos
Firefox/1.5.0.10" 4361 3217
66.249.64.13 - -
[18/Sep/2004:11:07:48 +1000]
"GET / HTTP/1.0" 200 6433 "-"
"Googlebot/2.1"
4
Log Format
66.249.64.13 - - [18/Sep/2004:11:07:48 +1000]
"GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
More Formal Definition of Apache Log
%h %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i“
%h = IP address of the client (remote host) which made the request
%l = RFC 1413 identity of the client
%u = userid of the person requesting the document
%t = Time that the server finished processing the request
%r = Request line from the client in double quotes
%s = Status code that the server sends back to the client
%b = Size of the object returned to the client
Referer : where the request originated
User-agent what type of agent made the request.
http://www.the-art-of-web.com/system/logs/
6
Common Response Code
• 200 - OK
• 206 - Partial Content
• 301 - Moved Permanently
• 302 - Found
• 304 - Not Modified
• 401 - Unauthorised (password required)
• 403 - Forbidden
• 404 - Not Found.
7
LogAnalyzer.java
public class LogAnalyzer {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: loganalyzer <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "analyze log");
job.setJarByClass(LogAnalyzer.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
8
Map.java
public class Map extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text url = new Text();
private Pattern p = Pattern.compile("(?:GET|POST)s([^s]+)");
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] entries = value.toString().split("r?n");
for (int i=0, len=entries.length; i<len; i+=1) {
Matcher matcher = p.matcher(entries[i]);
if (matcher.find()) {
url.set(matcher.group(1));
context.write(url, one);
}
}
}
}
9
Reduce.java
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable total = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
total.set(sum);
context.write(key, total);
}
}
10
Comet Cluster
• Comet cluster has 1944 nodes and each node has
24 cores, built on two 12-core Intel Xeon E5-2680v3
2.5 GHz processors
• 128 GB memory and 320GB SSD for local scratch
space.
• Attached storage: Shared 7 petabytes of 200
GB/second performance storage and 6 petabytes of
100 GB/second durable storage
 Lustre Storage Area is a Parallel File System
(PFS) called Data Oasis.
– Users can access from
/oasis/scratch/comet/$USER/temp_project
Home
local storage Login
node
/oasis
Hadoop installation at Comet
• Installed in /opt/hadoop/1.2.1
o Configure Hadoop on-demand with myHadoop:
 /opt/hadoop/contrib/myHadoop/bin/myhadoop-
configure.sh
Home
Linux
Hadoop connects local storage Login
node
Hadoop file system is built dynamically on the nodes
allocated. Deleted when the allocation is terminated.
Compile the sample Java code at Comet
Java word count example is available at Comet under
/home/tyang/cs240sample/mapreduce/.
• cp –r /home/tyang/cs240sample/mapreduce .
• Allocate a dedicated machine for compiling
 /share/apps/compute/interactive/qsubi.bash -p compute --
nodes=1 --ntasks-per-node=1 -t 00:
• Change work directory to mapreduce and type make
 Java code is compiled under target subdirectory
Home
Comet
Login
node
How to Run a WordCount Mapreduce Job
 Use “compute” partition for allocation
 Use Java word count example at Comet under
/home/tyang/cs240sample/mapreduce/.
 sbatch submit-hadoop-comet.sh
– Data input is in test.txt
– Data output is in WC-output
 Job trace is wordcount.1569018.comet-17-14.out
Home
Comet cluster
Login node
comet.sdsc.xsed
e.org
“compute” queue
Sample script (submit-hadoop-comet.sh)
#!/bin/bash
#SBATCH --job-name="wordcount"
#SBATCH --output="wordcount.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH -t 00:15:00
Export HADOOP_CONF_DIR=/home/$USER/cometcluster
export WORKDIR=`pwd`
module load hadoop/1.2.1
#Use myheadoop to build a Hadoop file system on allocated nodes
myhadoop-configure.sh
#Start all demons
start-all.sh Home
Linux
Login
node
Hadoop
Sample script
#make an input directory in the hadoop file system
hadoop dfs -mkdir input
#copy data from local Linux file system to the Hadoop file system
hadoop dfs -copyFromLocal $WORKDIR/test.txt input/
#Run Hadoop wordcount job
hadoop jar $WORKDIR/wordcount.jar wordcount input/ output/
# Create a local directory WC-output to host the output data
# It does not report error even the file does not exist
rm -rf WC-out >/dev/null || true
mkdir -p WC-out
# Copy out the output data
hadoop dfs -copyToLocal output/part* WC-out
#Stop all demons and cleanup
stop-all.sh
myhadoop-cleanup.sh Home
Linux Login
node
Hadoop
Sample output trace
wordcount.1569018.comet-17-14.out
starting namenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-namenode-comet-17-14.out
comet-17-14.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode-
comet-17-14.sdsc.edu.out
comet-17-15.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode-
comet-17-15.sdsc.edu.out
comet-17-14.ibnet: starting secondarynamenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-
secondarynamenode-comet-17-14.sdsc.edu.out
starting jobtracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-jobtracker-comet-17-14.out
comet-17-14.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker-
comet-17-14.sdsc.edu.out
comet-17-15.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker-
comet-17-15.sdsc.edu.out
Sample output trace
wordcount.1569018.comet-17-14.out
16/01/31 17:43:44 INFO input.FileInputFormat: Total input paths to process : 1
16/01/31 17:43:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/01/31 17:43:44 WARN snappy.LoadSnappy: Snappy native library not loaded
16/01/31 17:43:44 INFO mapred.JobClient: Running job: job_201601311743_0001
16/01/31 17:43:45 INFO mapred.JobClient: map 0% reduce 0%
16/01/31 17:43:49 INFO mapred.JobClient: map 100% reduce 0%
16/01/31 17:43:56 INFO mapred.JobClient: map 100% reduce 33%
16/01/31 17:43:57 INFO mapred.JobClient: map 100% reduce 100%
16/01/31 17:43:57 INFO mapred.JobClient: Job complete: job_201601311743_0001
comet-17-14.ibnet: stopping tasktracker
comet-17-15.ibnet: stopping tasktracker
stopping namenode
comet-17-14.ibnet: stopping datanode
comet-17-15.ibnet: stopping datanode
comet-17-14.ibnet: stopping secondarynamenode
Copying Hadoop logs back to /home/tyang/cometcluster/logs...
`/scratch/tyang/1569018/logs' -> `/home/tyang/cometcluster/logs'
Home
Linux
Login
node
Hadoop
Sample input and output
$ cat test.txt
how are you today 3 4 mapreduce program
1 2 3 test send
how are you mapreduce
1 send test USA california new
$ cat WC-out/part-r-00000
1 2
2 1
3 2
4 1
USA 1
are 2
california 1
how 2
mapreduce 2
new 1
program 1
send 2
test 2
today 1
you 2
Shell Commands for Hadoop File System
• Mkdir, ls, cat, cp
 hadoop dfs -mkdir /user/deepak/dir1
 hadoop dfs -ls /user/deepak
 hadoop dfs -cat /usr/deepak/file.txt
 hadoop dfs -cp /user/deepak/dir1/abc.txt /user/deepak/dir2
• Copy data from the local file system to HDF
 hadoop dfs -copyFromLocal <src:localFileSystem>
<dest:Hdfs>
 Ex: hadoop dfs –copyFromLocal
/home/hduser/def.txt /user/deepak/dir1
• Copy data from HDF to local
 hadoop dfs -copyToLocal <src:Hdfs>
<dest:localFileSystem>
http://www.bigdataplanet.info/2013/10/All-Hadoop-Shell-Commands-you-need-Hadoop-Tutorial-Part-5.html
Notes
• Java process listing “jps”, shows the following demons
NameNode (master), SecondaryNameNode, Datanode
(hadoop),JobTracker, TaskTracker
• To check the status of your job
squeue -u username
• To cancel a submitted job
scancel job-id
• You have to request *all* 24 cores on the nodes. Hadoop is
java based and any memory limits start causing problems.
Also, in the compute partition you are charged for the whole
node anyway.
Notes
• Your script should delete the outout directory if you want to
rerun and copy out data to that directory. Otherwise the
Hadoop copy back fails because the file already exists.
The current script forces to remove "WC-output".
• If you are running several Mapreduce jobs simultaneously,
please make sure you choose different locations for for the
configuration files. Basically change the line:
export HADOOP_CONF_DIR=/home/$USER/cometcluster
to point to different directories for each run. Otherwise the
configuration from different jobs will overwrite in the same
directory and cause problems.

More Related Content

PDF
Big data using Hadoop, Hive, Sqoop with Installation
PPT
Hadoop - Introduction to Hadoop
PDF
field_guide_to_hadoop_pentaho
PPTX
Cosmos, Big Data GE implementation in FIWARE
PPTX
Cosmos, Big Data GE Implementation
PDF
Assignment 1 MapReduce With Hadoop
PPTX
Hadoop & HDFS for Beginners
ODP
An example Hadoop Install
Big data using Hadoop, Hive, Sqoop with Installation
Hadoop - Introduction to Hadoop
field_guide_to_hadoop_pentaho
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE Implementation
Assignment 1 MapReduce With Hadoop
Hadoop & HDFS for Beginners
An example Hadoop Install

Similar to TopicMapReduceComet log analysis by using splunk (20)

PDF
myHadoop 0.30
PDF
Intro to hadoop tutorial
PPTX
Hadoop installation on windows
PDF
Hadoop - Lessons Learned
PPT
Recommender.system.presentation.pjug.05.20.2014
PDF
Interacting with hdfs
PDF
Docker Logging and analysing with Elastic Stack
PDF
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PPTX
5c_BigData_Hadoop_HDFS.PPTX
PPTX
Presentation sreenu dwh-services
PPT
Another Intro To Hadoop
PPTX
Hadoop, Map Reduce and Apache Pig tutorial
PPTX
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
PPT
Apache hadoop, hdfs and map reduce Overview
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Hadoop operations basic
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PPTX
Asbury Hadoop Overview
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PDF
RuG Guest Lecture
myHadoop 0.30
Intro to hadoop tutorial
Hadoop installation on windows
Hadoop - Lessons Learned
Recommender.system.presentation.pjug.05.20.2014
Interacting with hdfs
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack - Jakub Hajek
5c_BigData_Hadoop_HDFS.PPTX
Presentation sreenu dwh-services
Another Intro To Hadoop
Hadoop, Map Reduce and Apache Pig tutorial
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Apache hadoop, hdfs and map reduce Overview
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop operations basic
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Asbury Hadoop Overview
EclipseCon Keynote: Apache Hadoop - An Introduction
RuG Guest Lecture
Ad

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
history of c programming in notes for students .pptx
PDF
Complete React Javascript Course Syllabus.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Operating system designcfffgfgggggggvggggggggg
PPT
JAVA ppt tutorial basics to learn java programming
PPTX
Online Work Permit System for Fast Permit Processing
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
medical staffing services at VALiNTRY
L1 - Introduction to python Backend.pptx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Wondershare Filmora 15 Crack With Activation Key [2025
history of c programming in notes for students .pptx
Complete React Javascript Course Syllabus.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How Creative Agencies Leverage Project Management Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Operating system designcfffgfgggggggvggggggggg
JAVA ppt tutorial basics to learn java programming
Online Work Permit System for Fast Permit Processing
The Five Best AI Cover Tools in 2025.docx
PTS Company Brochure 2025 (1).pdf.......
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Odoo POS Development Services by CandidRoot Solutions
Materi-Enum-and-Record-Data-Type (1).pptx
Softaken Excel to vCard Converter Software.pdf
ISO 45001 Occupational Health and Safety Management System
medical staffing services at VALiNTRY
Ad

TopicMapReduceComet log analysis by using splunk

  • 1. Mapreduce Programming at Comet and HW2 Log Analysis UCSB CS240A 2016. Tao Yang
  • 2. 2 Data Analysis from Web Server Logs Startup code and data : /home/tyang/cs240sample/log apache1.splunk.com apache2.splunk.com apache3.splunk.com
  • 3. 02/09/2010 3 Example line of the log file 10.32.1.43 - - [06/Feb/2013:00:07:00] "GET /flower_store/product.screen?product_id=FL-DLH-02 HTTP/1.1" 200 10901 "http://mystore.splunk.com/flower_store/category.screen ?category_id=GIFTS&JSESSIONID=SD7SL1FF9ADFF2 " "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070223 CentOS/1.5.0.10-0.1.el4.centos Firefox/1.5.0.10" 4361 3217 66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
  • 4. 4 Log Format 66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] "GET / HTTP/1.0" 200 6433 "-" "Googlebot/2.1"
  • 5. More Formal Definition of Apache Log %h %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i“ %h = IP address of the client (remote host) which made the request %l = RFC 1413 identity of the client %u = userid of the person requesting the document %t = Time that the server finished processing the request %r = Request line from the client in double quotes %s = Status code that the server sends back to the client %b = Size of the object returned to the client Referer : where the request originated User-agent what type of agent made the request. http://www.the-art-of-web.com/system/logs/
  • 6. 6 Common Response Code • 200 - OK • 206 - Partial Content • 301 - Moved Permanently • 302 - Found • 304 - Not Modified • 401 - Unauthorised (password required) • 403 - Forbidden • 404 - Not Found.
  • 7. 7 LogAnalyzer.java public class LogAnalyzer { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 2) { System.err.println("Usage: loganalyzer <in> <out>"); System.exit(2); } Job job = new Job(conf, "analyze log"); job.setJarByClass(LogAnalyzer.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 8. 8 Map.java public class Map extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text url = new Text(); private Pattern p = Pattern.compile("(?:GET|POST)s([^s]+)"); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] entries = value.toString().split("r?n"); for (int i=0, len=entries.length; i<len; i+=1) { Matcher matcher = p.matcher(entries[i]); if (matcher.find()) { url.set(matcher.group(1)); context.write(url, one); } } } }
  • 9. 9 Reduce.java public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable total = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } total.set(sum); context.write(key, total); } }
  • 10. 10 Comet Cluster • Comet cluster has 1944 nodes and each node has 24 cores, built on two 12-core Intel Xeon E5-2680v3 2.5 GHz processors • 128 GB memory and 320GB SSD for local scratch space. • Attached storage: Shared 7 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage  Lustre Storage Area is a Parallel File System (PFS) called Data Oasis. – Users can access from /oasis/scratch/comet/$USER/temp_project Home local storage Login node /oasis
  • 11. Hadoop installation at Comet • Installed in /opt/hadoop/1.2.1 o Configure Hadoop on-demand with myHadoop:  /opt/hadoop/contrib/myHadoop/bin/myhadoop- configure.sh Home Linux Hadoop connects local storage Login node Hadoop file system is built dynamically on the nodes allocated. Deleted when the allocation is terminated.
  • 12. Compile the sample Java code at Comet Java word count example is available at Comet under /home/tyang/cs240sample/mapreduce/. • cp –r /home/tyang/cs240sample/mapreduce . • Allocate a dedicated machine for compiling  /share/apps/compute/interactive/qsubi.bash -p compute -- nodes=1 --ntasks-per-node=1 -t 00: • Change work directory to mapreduce and type make  Java code is compiled under target subdirectory Home Comet Login node
  • 13. How to Run a WordCount Mapreduce Job  Use “compute” partition for allocation  Use Java word count example at Comet under /home/tyang/cs240sample/mapreduce/.  sbatch submit-hadoop-comet.sh – Data input is in test.txt – Data output is in WC-output  Job trace is wordcount.1569018.comet-17-14.out Home Comet cluster Login node comet.sdsc.xsed e.org “compute” queue
  • 14. Sample script (submit-hadoop-comet.sh) #!/bin/bash #SBATCH --job-name="wordcount" #SBATCH --output="wordcount.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #SBATCH -t 00:15:00 Export HADOOP_CONF_DIR=/home/$USER/cometcluster export WORKDIR=`pwd` module load hadoop/1.2.1 #Use myheadoop to build a Hadoop file system on allocated nodes myhadoop-configure.sh #Start all demons start-all.sh Home Linux Login node Hadoop
  • 15. Sample script #make an input directory in the hadoop file system hadoop dfs -mkdir input #copy data from local Linux file system to the Hadoop file system hadoop dfs -copyFromLocal $WORKDIR/test.txt input/ #Run Hadoop wordcount job hadoop jar $WORKDIR/wordcount.jar wordcount input/ output/ # Create a local directory WC-output to host the output data # It does not report error even the file does not exist rm -rf WC-out >/dev/null || true mkdir -p WC-out # Copy out the output data hadoop dfs -copyToLocal output/part* WC-out #Stop all demons and cleanup stop-all.sh myhadoop-cleanup.sh Home Linux Login node Hadoop
  • 16. Sample output trace wordcount.1569018.comet-17-14.out starting namenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-namenode-comet-17-14.out comet-17-14.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode- comet-17-14.sdsc.edu.out comet-17-15.ibnet: starting datanode, logging to /scratch/tyang/1569018/logs/hadoop-tyang-datanode- comet-17-15.sdsc.edu.out comet-17-14.ibnet: starting secondarynamenode, logging to /scratch/tyang/1569018/logs/hadoop-tyang- secondarynamenode-comet-17-14.sdsc.edu.out starting jobtracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-jobtracker-comet-17-14.out comet-17-14.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker- comet-17-14.sdsc.edu.out comet-17-15.ibnet: starting tasktracker, logging to /scratch/tyang/1569018/logs/hadoop-tyang-tasktracker- comet-17-15.sdsc.edu.out
  • 17. Sample output trace wordcount.1569018.comet-17-14.out 16/01/31 17:43:44 INFO input.FileInputFormat: Total input paths to process : 1 16/01/31 17:43:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 16/01/31 17:43:44 WARN snappy.LoadSnappy: Snappy native library not loaded 16/01/31 17:43:44 INFO mapred.JobClient: Running job: job_201601311743_0001 16/01/31 17:43:45 INFO mapred.JobClient: map 0% reduce 0% 16/01/31 17:43:49 INFO mapred.JobClient: map 100% reduce 0% 16/01/31 17:43:56 INFO mapred.JobClient: map 100% reduce 33% 16/01/31 17:43:57 INFO mapred.JobClient: map 100% reduce 100% 16/01/31 17:43:57 INFO mapred.JobClient: Job complete: job_201601311743_0001 comet-17-14.ibnet: stopping tasktracker comet-17-15.ibnet: stopping tasktracker stopping namenode comet-17-14.ibnet: stopping datanode comet-17-15.ibnet: stopping datanode comet-17-14.ibnet: stopping secondarynamenode Copying Hadoop logs back to /home/tyang/cometcluster/logs... `/scratch/tyang/1569018/logs' -> `/home/tyang/cometcluster/logs' Home Linux Login node Hadoop
  • 18. Sample input and output $ cat test.txt how are you today 3 4 mapreduce program 1 2 3 test send how are you mapreduce 1 send test USA california new $ cat WC-out/part-r-00000 1 2 2 1 3 2 4 1 USA 1 are 2 california 1 how 2 mapreduce 2 new 1 program 1 send 2 test 2 today 1 you 2
  • 19. Shell Commands for Hadoop File System • Mkdir, ls, cat, cp  hadoop dfs -mkdir /user/deepak/dir1  hadoop dfs -ls /user/deepak  hadoop dfs -cat /usr/deepak/file.txt  hadoop dfs -cp /user/deepak/dir1/abc.txt /user/deepak/dir2 • Copy data from the local file system to HDF  hadoop dfs -copyFromLocal <src:localFileSystem> <dest:Hdfs>  Ex: hadoop dfs –copyFromLocal /home/hduser/def.txt /user/deepak/dir1 • Copy data from HDF to local  hadoop dfs -copyToLocal <src:Hdfs> <dest:localFileSystem> http://www.bigdataplanet.info/2013/10/All-Hadoop-Shell-Commands-you-need-Hadoop-Tutorial-Part-5.html
  • 20. Notes • Java process listing “jps”, shows the following demons NameNode (master), SecondaryNameNode, Datanode (hadoop),JobTracker, TaskTracker • To check the status of your job squeue -u username • To cancel a submitted job scancel job-id • You have to request *all* 24 cores on the nodes. Hadoop is java based and any memory limits start causing problems. Also, in the compute partition you are charged for the whole node anyway.
  • 21. Notes • Your script should delete the outout directory if you want to rerun and copy out data to that directory. Otherwise the Hadoop copy back fails because the file already exists. The current script forces to remove "WC-output". • If you are running several Mapreduce jobs simultaneously, please make sure you choose different locations for for the configuration files. Basically change the line: export HADOOP_CONF_DIR=/home/$USER/cometcluster to point to different directories for each run. Otherwise the configuration from different jobs will overwrite in the same directory and cause problems.

Editor's Notes