SlideShare a Scribd company logo
Functional Programming and Big Data 
http://glennengstrand.info/analytics/fp 
What role will Functional 
Prgramming play in processing 
Big Data streams? 
Glenn Engstrand 
September 2014
Clojure News Feed 
http://glennengstrand.info/software/architecture/oss/clojure 
union 
intersection 
difference 
map 
reduce
OSCON 2014 
Big Data Pipeline and Analytics Platform Using NetflixOSS and 
Other Open Source Libraries 
http://www.oscon.com/oscon2014/public/schedule/detail/34159 
Data Workflows for Machine Learning 
http://www.oscon.com/oscon2014/public/schedule/detail/34913
netflix 
PigPen is map-reduce for Clojure, or distributed Clojure. It 
compiles to Apache Pig, but you don't need to know much 
about Pig to use it. 
https://github.com/Netflix/PigPen
query like syntax 
(defn my-query 
[data] 
(->> data 
(pig/map my-map) 
(pig/filter (fn [x] (= (:action x) "post"))) 
(pig/group-by :ts {:fold (fold/count)}) 
(pig/store-tsv "/path/to/newsFeedPigOutput")))
clumsy process 
cd /path/to/git/clojure-news-feed/client/pigpenperf 
lein run 
# remove the :main from project.clj 
lein uberjar 
cp target/pigpenperf-0.1.0-SNAPSHOT-standalone.jar 
~/oss/hadoop/pig-0.12.1/pigpen.jar 
cd /path/to/oss/hadoop/pig-0.12.1 
bin/pig -x local -f /path/to/pigpenperf.pig
Cascading 
Fully-featured data processing and 
querying library for Clojure or Java. 
http://cascalog.org/ 
Cascading is the proven application 
development platform for building data 
applications on Hadoop. 
http://www.cascading.org/
declarative and implicit 
(defn per-minute-post-action-counts 
"count of post operations grouped by time stamp" 
[input-directory output-directory] 
(let [data-point (metrics input-directory) 
output (hfs-delimited output-directory)] 
(c/?<- output 
[?ts ?cnt] 
(data-point ?year ?month ?day ?hour ?minute ?entity ?action 
?count) 
(format-time-stamp ?year ?month ?day ?hour ?minute :> ?ts) 
(= ?action "post") 
(o/count :> ?cnt))))
ideomatic 
(defn parse-data-line 
"parses the kafka output into the corresponding fields" 
[line] 
(s/split line #"|")) 
(defn metrics [dir] 
(let [source (c/hfs-textline dir)] 
(c/<- [?year ?month ?day ?hour ?minute ?entity ?action ?count] 
(source ?line) 
(parse-data-line ?line :> ?year ?month ?day ?hour ?minute 
?entity ?action ?count) 
(:distinct false))))
Scala compared to... 
strongly typed 
more versatile 
less ideomatic 
no homoiconicity 
more mainstream 
http://www.scala-lang.org/ 
lambda expressions 
for comprehensions 
streams 
higher order 
functions 
Clojure 
Java 7
spark shell 
val t = sc.textFile("/path/to/newsFeedRawMetrics/perfpostgres.csv") 
t.filter(line => line.contains("post")) 
.map(line => (line.split(",").slice(0, 5).mkString(","), 1)) 
.reduceByKey(_ + _) 
.saveAsTextFile("/tmp/postCount")
map reduce 
fast 
compact 
interactive 
not as distributive 
limited reduce side 
good for counters 
not good for percentiles
margin for error 
unfair basis for comparison 
local spark does not use hadoop 
single node mode
custom functions 
built in functions are not as 
expressive as hive 
can custom functions be as 
expressive as YARN? 
future blog 
Cascalog equivalent to News Feed 
Performance map reduce job.
spark streaming 
more popular than spark map reduce 
more real-time and reactive 
future blog 
compare with cascalog for reproducing news 
feed performance map reduce functionality 
Is it really distributed?

More Related Content

PDF
Flink meetup
PPTX
Team3 presentation
PPTX
Beyond Lists - Functional Kats Conf Dublin 2015
PDF
Neat Analytics with Pandas Indexes, Alexander Hendorf
DOCX
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
PDF
Introduction to Data Analtics with Pandas [PyCon Cz]
PDF
Pgrouting_foss4guk_ross_mcdonald
PDF
Illustrator_Sample
Flink meetup
Team3 presentation
Beyond Lists - Functional Kats Conf Dublin 2015
Neat Analytics with Pandas Indexes, Alexander Hendorf
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
Introduction to Data Analtics with Pandas [PyCon Cz]
Pgrouting_foss4guk_ross_mcdonald
Illustrator_Sample

What's hot (20)

PPTX
GeoTuple a Framework for Web Based Geo-Analytics with R and PostGIS
PDF
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
PDF
Aggregators: Data Day Texas, 2015
ODP
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
PPTX
First impressions of SparkR: our own machine learning algorithm
PPTX
Data visualization in python/Django
ODP
Daniel Sikar: Hadoop MapReduce - 06/09/2010
PDF
Graphalytics: A big data benchmark for graph-processing platforms
PDF
Luigi presentation NYC Data Science
PPTX
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
PDF
Luigi Presentation at OSCON 2013
PPTX
Hive query optimization infinity
PPTX
Spark by Adform Research, Paulius
PDF
Pdf sample3
DOCX
Raw system logs processing with hive
PPTX
2017 02-07 - elastic & spark. building a search geo locator
PDF
pmux
PDF
Semantic search within Earth Observation products databases based on automati...
PDF
Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-C
PPT
Map reduce (from Google)
GeoTuple a Framework for Web Based Geo-Analytics with R and PostGIS
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Aggregators: Data Day Texas, 2015
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
First impressions of SparkR: our own machine learning algorithm
Data visualization in python/Django
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Graphalytics: A big data benchmark for graph-processing platforms
Luigi presentation NYC Data Science
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Luigi Presentation at OSCON 2013
Hive query optimization infinity
Spark by Adform Research, Paulius
Pdf sample3
Raw system logs processing with hive
2017 02-07 - elastic & spark. building a search geo locator
pmux
Semantic search within Earth Observation products databases based on automati...
Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-C
Map reduce (from Google)
Ad

Similar to Three Functional Programming Technologies for Big Data (20)

PDF
Spark what's new what's coming
PDF
Apache Flink & Graph Processing
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Spark devoxx2014
PDF
Adios hadoop, Hola Spark! T3chfest 2015
PDF
So you think you can stream.pptx
PPTX
Monitoring Spark Applications
PDF
Productionizing your Streaming Jobs
PDF
Unified Big Data Processing with Apache Spark
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PPT
Spark training-in-bangalore
PDF
Jump Start into Apache® Spark™ and Databricks
PPT
Hadoop trainingin bangalore
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PPTX
Lipstick On Pig
PPTX
Putting Lipstick on Apache Pig at Netflix
PPTX
Netflix - Pig with Lipstick by Jeff Magnusson
PPTX
Flink internals web
PDF
Spark (Structured) Streaming vs. Kafka Streams
PPT
r,rstats,r language,r packages
Spark what's new what's coming
Apache Flink & Graph Processing
Big Data Processing with .NET and Spark (SQLBits 2020)
Spark devoxx2014
Adios hadoop, Hola Spark! T3chfest 2015
So you think you can stream.pptx
Monitoring Spark Applications
Productionizing your Streaming Jobs
Unified Big Data Processing with Apache Spark
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Spark training-in-bangalore
Jump Start into Apache® Spark™ and Databricks
Hadoop trainingin bangalore
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Lipstick On Pig
Putting Lipstick on Apache Pig at Netflix
Netflix - Pig with Lipstick by Jeff Magnusson
Flink internals web
Spark (Structured) Streaming vs. Kafka Streams
r,rstats,r language,r packages
Ad

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Understanding Prototyping in Design and Development
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PDF
Report The-State-of-AIOps 20232032 3.pdf
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Data Science Trends & Career Guide---ppt
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
climate analysis of Dhaka ,Banglades.pptx
Understanding Prototyping in Design and Development
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Report The-State-of-AIOps 20232032 3.pdf
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Logistic Regression ml machine learning.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Data Science Trends & Career Guide---ppt
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Major-Components-ofNKJNNKNKNKNKronment.pptx
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Reliability_Chapter_ presentation 1221.5784

Three Functional Programming Technologies for Big Data

  • 1. Functional Programming and Big Data http://glennengstrand.info/analytics/fp What role will Functional Prgramming play in processing Big Data streams? Glenn Engstrand September 2014
  • 2. Clojure News Feed http://glennengstrand.info/software/architecture/oss/clojure union intersection difference map reduce
  • 3. OSCON 2014 Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries http://www.oscon.com/oscon2014/public/schedule/detail/34159 Data Workflows for Machine Learning http://www.oscon.com/oscon2014/public/schedule/detail/34913
  • 4. netflix PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it. https://github.com/Netflix/PigPen
  • 5. query like syntax (defn my-query [data] (->> data (pig/map my-map) (pig/filter (fn [x] (= (:action x) "post"))) (pig/group-by :ts {:fold (fold/count)}) (pig/store-tsv "/path/to/newsFeedPigOutput")))
  • 6. clumsy process cd /path/to/git/clojure-news-feed/client/pigpenperf lein run # remove the :main from project.clj lein uberjar cp target/pigpenperf-0.1.0-SNAPSHOT-standalone.jar ~/oss/hadoop/pig-0.12.1/pigpen.jar cd /path/to/oss/hadoop/pig-0.12.1 bin/pig -x local -f /path/to/pigpenperf.pig
  • 7. Cascading Fully-featured data processing and querying library for Clojure or Java. http://cascalog.org/ Cascading is the proven application development platform for building data applications on Hadoop. http://www.cascading.org/
  • 8. declarative and implicit (defn per-minute-post-action-counts "count of post operations grouped by time stamp" [input-directory output-directory] (let [data-point (metrics input-directory) output (hfs-delimited output-directory)] (c/?<- output [?ts ?cnt] (data-point ?year ?month ?day ?hour ?minute ?entity ?action ?count) (format-time-stamp ?year ?month ?day ?hour ?minute :> ?ts) (= ?action "post") (o/count :> ?cnt))))
  • 9. ideomatic (defn parse-data-line "parses the kafka output into the corresponding fields" [line] (s/split line #"|")) (defn metrics [dir] (let [source (c/hfs-textline dir)] (c/<- [?year ?month ?day ?hour ?minute ?entity ?action ?count] (source ?line) (parse-data-line ?line :> ?year ?month ?day ?hour ?minute ?entity ?action ?count) (:distinct false))))
  • 10. Scala compared to... strongly typed more versatile less ideomatic no homoiconicity more mainstream http://www.scala-lang.org/ lambda expressions for comprehensions streams higher order functions Clojure Java 7
  • 11. spark shell val t = sc.textFile("/path/to/newsFeedRawMetrics/perfpostgres.csv") t.filter(line => line.contains("post")) .map(line => (line.split(",").slice(0, 5).mkString(","), 1)) .reduceByKey(_ + _) .saveAsTextFile("/tmp/postCount")
  • 12. map reduce fast compact interactive not as distributive limited reduce side good for counters not good for percentiles
  • 13. margin for error unfair basis for comparison local spark does not use hadoop single node mode
  • 14. custom functions built in functions are not as expressive as hive can custom functions be as expressive as YARN? future blog Cascalog equivalent to News Feed Performance map reduce job.
  • 15. spark streaming more popular than spark map reduce more real-time and reactive future blog compare with cascalog for reproducing news feed performance map reduce functionality Is it really distributed?