SlideShare a Scribd company logo
Map/Reduce
Обзор решений
Алексей Злобин
alexey.zlobin@gmail.com
Sample job: driver
public static void main(String[] a) throws Exception {
Configuration conf = new Configuration();
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(a[0]));
FileOutputFormat.setOutputPath(job, new Path(a[1]));
job.waitForCompletion(true);
}
Sample job: mapper
class M extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable k, Text v, Context ctx) {
String line = v.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
ctx.write(word, one);
}
}
}
Sample job: reducer
class R extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text k, Iterable<IntWritable> v, Context ctx)
{
int sum = 0;
for (IntWritable val : v)
sum += val.get();
context.write(k, new IntWritable(sum));
}
}
Pig snippet
raw =
LOAD 'excite.log' USING PigStorage('t') AS (user, time, qry);
clean1 = FILTER raw BY
org.apache.pig.tutorial.NonURLDetector(qry);
clean2 = FOREACH clean1
GENERATE user, time, org.apache.pig.tutorial.ToLower(qr)
as query;
Hive snippet
CREATE TABLE invites
(foo INT, bar STRING) PARTITIONED BY (ds STRING);
LOAD DATA LOCAL
INPATH './examples/files/kv2.txt' OVERWRITE
INTO TABLE invites PARTITION (ds='2008-08-15');
SELECT a.foo
FROM invites a
WHERE a.ds='2008-08-15';
INSERT OVERWRITE DIRECTORY '/tmp/reg_5'
SELECT a.foo, a.bar FROM invites a;
Spark: example
val counts = lines.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
Shark example
CREATE TABLE src(key INT, value STRING);
LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt'
INTO TABLE src;
SELECT COUNT(1) FROM src;
CREATE TABLE src_cached AS SELECT * FROM SRC;
SELECT COUNT(1) FROM src_cached;
Disco example
def fun_map(line, params):
for word in line.split():
yield word, 1
def fun_reduce(iter, params):
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)
Disco driver
job = Job().run(
input=["http://discoproject.org/media/text/chekhov.
txt"],
map=map,
reduce=reduce)
for word, count in result_iterator(job.wait(show=True)):
print(word, count)
References I
● “MapReduce: Simplified Data Processing on Large Clusters” Dean, Jeffrey and
Ghemawat, Sanjay
● “A Comparison of Join Algorithms for Log Processing in MapReduce” S. Blanas, J.
Patel, V. Ercegovac, J. Rao, E. Shekita, Y. Tian
● “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing” Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica
● “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing
on Large Clusters” Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion
Stoica
● “Shark: Fast Data Analysis Using Coarse-grained Distributed Memory” Cliff Engle,
Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica
References II
● Disco Technical Overview http://disco.readthedocs.org/en/latest/overview.html
● Disco Distributed Filesystem http://disco.readthedocs.org/en/latest/howto/ddfs.html
● An efficient, immutable, persistent mapping object http://discodb.readthedocs.
org/en/latest/

More Related Content

DOCX
R-ggplot2 package Examples
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PDF
Viliam Ganz - Domain Specific Languages
PPT
Schema Design by Chad Tindel, Solution Architect, 10gen
KEY
Hadoop本 輪読会 1章〜2章
PDF
regex-presentation_ed_goodwin
PDF
Morel, a Functional Query Language
PDF
Flux and InfluxDB 2.0 by Paul Dix
R-ggplot2 package Examples
Spark 4th Meetup Londond - Building a Product with Spark
Viliam Ganz - Domain Specific Languages
Schema Design by Chad Tindel, Solution Architect, 10gen
Hadoop本 輪読会 1章〜2章
regex-presentation_ed_goodwin
Morel, a Functional Query Language
Flux and InfluxDB 2.0 by Paul Dix

What's hot (20)

PPTX
Python Seaborn Data Visualization
PDF
05. haskell streaming io
PPTX
R seminar dplyr package
PPTX
decision tree regression
PDF
Do something in 5 minutes with gas 1-use spreadsheet as database
PPTX
Вячеслав Блинов: "Spring Integration as an Integration Patterns Provider"
TXT
Code
PPTX
Air Quality in Taiwan 2013
PDF
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
PDF
python高级内存管理
PDF
Aerospike Nested CDTs - Meetup Dec 2019
PPTX
Hacking the Internet of Things for Fun & Profit
ODP
Aggregation Framework in MongoDB Overview Part-1
PDF
Do something in 5 with gas 3-simple invoicing app
PDF
Ganga: an interface to the LHC computing grid
PDF
Look Mommy, No GC! (TechDays NL 2017)
PPTX
polynomial linear regression
PDF
Google Sheets in Python with gspread
PDF
Do something in 5 with gas 7-email log
PDF
LINE iOS開発で実践しているGit tips
Python Seaborn Data Visualization
05. haskell streaming io
R seminar dplyr package
decision tree regression
Do something in 5 minutes with gas 1-use spreadsheet as database
Вячеслав Блинов: "Spring Integration as an Integration Patterns Provider"
Code
Air Quality in Taiwan 2013
Do something useful in Apps Script 5. Get your analytics pageviews to a sprea...
python高级内存管理
Aerospike Nested CDTs - Meetup Dec 2019
Hacking the Internet of Things for Fun & Profit
Aggregation Framework in MongoDB Overview Part-1
Do something in 5 with gas 3-simple invoicing app
Ganga: an interface to the LHC computing grid
Look Mommy, No GC! (TechDays NL 2017)
polynomial linear regression
Google Sheets in Python with gspread
Do something in 5 with gas 7-email log
LINE iOS開発で実践しているGit tips
Ad

Similar to 20140427 parallel programming_zlobin_lecture11 (20)

PDF
MapReduce
PDF
Introduction to Scalding and Monoids
ODP
Introduction to R
PPTX
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
PDF
JRubyKaigi2010 Hadoop Papyrus
PDF
Hadoop + Clojure
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
PDF
Pune Clojure Course Outline
PPTX
Introduction to Map-Reduce Programming with Hadoop
PPTX
Modern technologies in data science
PPT
Hadoop_Pennonsoft
PDF
Hw09 Hadoop + Clojure
DOC
Hadoop源码分析 mapreduce部分
PDF
Introducción a hadoop
PPT
Hadoop - Introduction to mapreduce
PDF
Refactoring
PDF
Functional programming using underscorejs
DOCX
ggtimeseries-->ggplot2 extensions
PDF
Cascading Through Hadoop for the Boulder JUG
MapReduce
Introduction to Scalding and Monoids
Introduction to R
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
JRubyKaigi2010 Hadoop Papyrus
Hadoop + Clojure
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Pune Clojure Course Outline
Introduction to Map-Reduce Programming with Hadoop
Modern technologies in data science
Hadoop_Pennonsoft
Hw09 Hadoop + Clojure
Hadoop源码分析 mapreduce部分
Introducción a hadoop
Hadoop - Introduction to mapreduce
Refactoring
Functional programming using underscorejs
ggtimeseries-->ggplot2 extensions
Cascading Through Hadoop for the Boulder JUG
Ad

More from Computer Science Club (20)

PDF
20141223 kuznetsov distributed
PDF
Computer Vision
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
PDF
20140531 serebryany lecture02_find_scary_cpp_bugs
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
PDF
20140511 parallel programming_kalishenko_lecture12
PDF
20140420 parallel programming_kalishenko_lecture10
PDF
20140413 parallel programming_kalishenko_lecture09
PDF
20140329 graph drawing_dainiak_lecture02
PDF
20140329 graph drawing_dainiak_lecture01
PDF
20140310 parallel programming_kalishenko_lecture03-04
PDF
20140223-SuffixTrees-lecture01-03
PDF
20140216 parallel programming_kalishenko_lecture01
PDF
20131106 h10 lecture6_matiyasevich
PDF
20131027 h10 lecture5_matiyasevich
PDF
20131027 h10 lecture5_matiyasevich
PDF
20131013 h10 lecture4_matiyasevich
PDF
20131006 h10 lecture3_matiyasevich
PDF
20131006 h10 lecture3_matiyasevich
PDF
20131006 h10 lecture2_matiyasevich
20141223 kuznetsov distributed
Computer Vision
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
20140511 parallel programming_kalishenko_lecture12
20140420 parallel programming_kalishenko_lecture10
20140413 parallel programming_kalishenko_lecture09
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture01
20140310 parallel programming_kalishenko_lecture03-04
20140223-SuffixTrees-lecture01-03
20140216 parallel programming_kalishenko_lecture01
20131106 h10 lecture6_matiyasevich
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
20131013 h10 lecture4_matiyasevich
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture2_matiyasevich

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Assigned Numbers - 2025 - Bluetooth® Document
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Programs and apps: productivity, graphics, security and other tools
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity

20140427 parallel programming_zlobin_lecture11

  • 2. Sample job: driver public static void main(String[] a) throws Exception { Configuration conf = new Configuration(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(a[0])); FileOutputFormat.setOutputPath(job, new Path(a[1])); job.waitForCompletion(true); }
  • 3. Sample job: mapper class M extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable k, Text v, Context ctx) { String line = v.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); ctx.write(word, one); } } }
  • 4. Sample job: reducer class R extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text k, Iterable<IntWritable> v, Context ctx) { int sum = 0; for (IntWritable val : v) sum += val.get(); context.write(k, new IntWritable(sum)); } }
  • 5. Pig snippet raw = LOAD 'excite.log' USING PigStorage('t') AS (user, time, qry); clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(qry); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(qr) as query;
  • 6. Hive snippet CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  • 7. Spark: example val counts = lines.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 8. Shark example CREATE TABLE src(key INT, value STRING); LOAD DATA LOCAL INPATH '${env:HIVE_HOME}/examples/files/kv1.txt' INTO TABLE src; SELECT COUNT(1) FROM src; CREATE TABLE src_cached AS SELECT * FROM SRC; SELECT COUNT(1) FROM src_cached;
  • 9. Disco example def fun_map(line, params): for word in line.split(): yield word, 1 def fun_reduce(iter, params): for word, counts in kvgroup(sorted(iter)): yield word, sum(counts)
  • 10. Disco driver job = Job().run( input=["http://discoproject.org/media/text/chekhov. txt"], map=map, reduce=reduce) for word, count in result_iterator(job.wait(show=True)): print(word, count)
  • 11. References I ● “MapReduce: Simplified Data Processing on Large Clusters” Dean, Jeffrey and Ghemawat, Sanjay ● “A Comparison of Join Algorithms for Log Processing in MapReduce” S. Blanas, J. Patel, V. Ercegovac, J. Rao, E. Shekita, Y. Tian ● “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica ● “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters” Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica ● “Shark: Fast Data Analysis Using Coarse-grained Distributed Memory” Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica
  • 12. References II ● Disco Technical Overview http://disco.readthedocs.org/en/latest/overview.html ● Disco Distributed Filesystem http://disco.readthedocs.org/en/latest/howto/ddfs.html ● An efficient, immutable, persistent mapping object http://discodb.readthedocs. org/en/latest/