SlideShare a Scribd company logo
Milan – July 13 2016
Introduction to
Distributed Computing Engines for
Data Processing
Simone Robutti
Machine Learning Engineer at Radicalbit
@SimoneRobutti
What is a Distributed Computing System
It’s the solution to the problem where your
RAM is too small and your data are too big
and/or too CPU-intensive to be processed on
a single machine.
What is a Distributed Computing System
Solution: a huge, monolithic mainframe.
What is a Distributed Computing System
Solution: a huge, monolithic mainframe.
What is a Distributed Computing System
Solution: do your job on a cluster.
Distributed vs Parallel
Parallel: execute identical tasks (with different data
or parameters).
Parallel Distributed: do this on multiple machines.
Distributed: split a big task into smaller tasks and
execute them on multiple machines
What is a Distributed Computing System
Goal: the programmer should write its
programs easily and efficiently without caring
about distribution.
Issues: a cluster is complex and conceptually
very far from a local environment.
Hadoop
What: the first OSS distributed computing engine.
When: 2006 (work began).
Where: Google.
Why: Google had a lot of data (for that time) to process. They built
a solution. Eventually it became a series of papers and got
implemented as OSS.
How: HDFS (distributed file system), MapReduce (computational
abstraction), YARN (resource and cluster manager).
MapReduce - Example
Credits to @sergejusb
Hadoop Today
Common in many enterprise environments.
Still good enough for many batch processing use cases.
HDFS and Yarn widely used by other processing engines.
i.e.:
● Log analysis
● Clickstream analysis
● Text processing
Spark
What: a more generic distributed processing engine for batch and
streaming alike.
When: 2014 (1.0 Release).
Where: Berkeley + Databricks.
Why: They aimed for faster and more general processing with
better abstractions on top.
How: InMemory computing, DAG, RDD, polyglot functional API,
libraries out-of-the-box .
Resilient Distributed Dataset
RDDs hide the underlying distribution of data with a functional API
Directed Acyclic Graph
The graph is defined by the user. The runtime translates it to
operations on distributed data.
create filter
filter join
collect
map
Spark Today
The hot topic everyone talks about.
Just entered the phase of maturity, with an already huge and fast-
growing ecosystem of libraries, integrations and tools.
Widely used as the go-to solution for Big Data (and not-so-Big) use
cases.
I.e.:
● Recommending systems
● Fraud Detection
● Attack-Detection
● Near-Real time decision-heavy solutions
Flink
What: a streaming first (with batch on top), low latency distributed
processing engine.
When: 2016 (1.0 Release).
Where: German Research Foundation + dataArtisans.
Why: a faster and flexible computational model that could
guarantee low latency, high-throughput and fault-tolerance all at the
same time.
How: streaming-first approach, checkpointing, lazy symbolic
computation, powerful optimizations.
Flink Today
Perceived as an alternative to Spark. Gaining traction for specific
use-cases (real-time streaming) but performs well on most generic
uses cases.
Solid runtime and optimization; API and ecosystem still young.
Many big companies already adopted it for fast-data applications.
I.e.:
● Real-time precise analytics (counting)
● Real-time model evaluation
● Online Learning solutions
Alternative solutions
● Apache Storm/Heron
● Apache Samza
● Apache GearPump
● Apache Apex
Following next
"The Barclays Data Science Hackathon: Building
Retail Recommender Systems based on Customer
Shopping Behaviour" by Gianmario Spacagna, Senior Data
Scientist @ Pirelli
"Data intensive applications with Apache Flink" by
Simone Robutti, Machine Learning Engineer @ Radicalbit

More Related Content

Viewers also liked (19)

PPTX
Inaugural talk Data Science Milan - Gianmario Spacagna
Data Science Milan
 
PDF
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data Science Milan
 
PDF
mlk-newsletter-april-2013
LauraOlivia OCampo
 
PPTX
La revolución-verde-kappa-cornforth
Sarahí Garcia
 
PPTX
Planning my blog1
Ximena Calle
 
DOCX
Racial Profiling and Its Effects
Chey Bradley
 
PPTX
Equipo kappa-1
Sarahí Garcia
 
DOC
Rajesh
Rajesh Babu
 
PPTX
505LeePosterPresentation
Anita Louise Kariniemi
 
PDF
Britton_NoAH World Package Design 10_Page_1-10
Patti Britton
 
PPTX
Instituciones administrativas del trabajo
yessihernendez
 
PPTX
Tecnologias de la comunicación y su infuencia
jhon alexander garcia marin
 
PPTX
Proyecto de-química
Sarahí Garcia
 
PPTX
Bollards
mnfsteel
 
PPTX
SMART Board PowerPoint
Anita Louise Kariniemi
 
PPTX
Como a evolucionado la tecnología
LUIS FERNANDO LEON PINTO
 
PPTX
Contaminacion
Sharleen Lugo Mata
 
PDF
CPerrotta Resume 2016
Chris Perrotta
 
PPTX
Como a evolucionado la tecnología
LUIS FERNANDO LEON PINTO
 
Inaugural talk Data Science Milan - Gianmario Spacagna
Data Science Milan
 
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
Data Science Milan
 
mlk-newsletter-april-2013
LauraOlivia OCampo
 
La revolución-verde-kappa-cornforth
Sarahí Garcia
 
Planning my blog1
Ximena Calle
 
Racial Profiling and Its Effects
Chey Bradley
 
Equipo kappa-1
Sarahí Garcia
 
Rajesh
Rajesh Babu
 
505LeePosterPresentation
Anita Louise Kariniemi
 
Britton_NoAH World Package Design 10_Page_1-10
Patti Britton
 
Instituciones administrativas del trabajo
yessihernendez
 
Tecnologias de la comunicación y su infuencia
jhon alexander garcia marin
 
Proyecto de-química
Sarahí Garcia
 
Bollards
mnfsteel
 
SMART Board PowerPoint
Anita Louise Kariniemi
 
Como a evolucionado la tecnología
LUIS FERNANDO LEON PINTO
 
Contaminacion
Sharleen Lugo Mata
 
CPerrotta Resume 2016
Chris Perrotta
 
Como a evolucionado la tecnología
LUIS FERNANDO LEON PINTO
 

Similar to Introduction to Distributed Computing Engines for Data Processing - Simone Robutti, Radicalbit (20)

PPTX
Intro to Spark development
Spark Summit
 
PDF
Introduction to Spark Training
Spark Summit
 
PDF
Introduction to Apache Spark
datamantra
 
PPTX
Big Data for QAs
Ahmed Misbah
 
PPTX
Apache spark - History and market overview
Martin Zapletal
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
PPTX
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Geoffrey Fox
 
PDF
Big data trends challenges opportunities
Mohammed Guller
 
PPT
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
sasuke20y4sh
 
PDF
Big data processing with apache spark
sarith divakar
 
PPT
hadoop_spark_Introduction_Bigdata_intro.ppt
anuroopdv
 
PPTX
The future of Big Data tooling
Data Science Society
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PDF
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
PPTX
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Intro to Spark development
Spark Summit
 
Introduction to Spark Training
Spark Summit
 
Introduction to Apache Spark
datamantra
 
Big Data for QAs
Ahmed Misbah
 
Apache spark - History and market overview
Martin Zapletal
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Dev Ops Training
Spark Summit
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Geoffrey Fox
 
Big data trends challenges opportunities
Mohammed Guller
 
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
sasuke20y4sh
 
Big data processing with apache spark
sarith divakar
 
hadoop_spark_Introduction_Bigdata_intro.ppt
anuroopdv
 
The future of Big Data tooling
Data Science Society
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Ad

More from Data Science Milan (20)

PDF
ML & Graph algorithms to prevent financial crime in digital payments
Data Science Milan
 
PDF
How to use the Economic Complexity Index to guide innovation plans
Data Science Milan
 
PDF
Robustness Metrics for ML Models based on Deep Learning Methods
Data Science Milan
 
PDF
"You don't need a bigger boat": serverless MLOps for reasonable companies
Data Science Milan
 
PDF
Question generation using Natural Language Processing by QuestGen.AI
Data Science Milan
 
PDF
Speed up data preparation for ML pipelines on AWS
Data Science Milan
 
PPTX
Serverless machine learning architectures at Helixa
Data Science Milan
 
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
PDF
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
PDF
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
PDF
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Data Science Milan
 
PDF
Audience projection of target consumers over multiple domains a ner and baye...
Data Science Milan
 
PDF
Weak supervised learning - Kristina Khvatova
Data Science Milan
 
PDF
GANs beyond nice pictures: real value of data generation, Alex Honchar
Data Science Milan
 
PDF
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Data Science Milan
 
PDF
3D Point Cloud analysis using Deep Learning
Data Science Milan
 
PDF
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Data Science Milan
 
PDF
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
PDF
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Data Science Milan
 
PDF
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
Data Science Milan
 
ML & Graph algorithms to prevent financial crime in digital payments
Data Science Milan
 
How to use the Economic Complexity Index to guide innovation plans
Data Science Milan
 
Robustness Metrics for ML Models based on Deep Learning Methods
Data Science Milan
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
Data Science Milan
 
Question generation using Natural Language Processing by QuestGen.AI
Data Science Milan
 
Speed up data preparation for ML pipelines on AWS
Data Science Milan
 
Serverless machine learning architectures at Helixa
Data Science Milan
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Data Science Milan
 
Audience projection of target consumers over multiple domains a ner and baye...
Data Science Milan
 
Weak supervised learning - Kristina Khvatova
Data Science Milan
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
Data Science Milan
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Data Science Milan
 
3D Point Cloud analysis using Deep Learning
Data Science Milan
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Data Science Milan
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Data Science Milan
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
Data Science Milan
 
Ad

Recently uploaded (20)

PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PPT
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PPTX
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
DOCX
Cat_Latin_America_in_World_Politics[1].docx
sales480687
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PPTX
Parental Leave Policies & Research Bulgaria
Elitsa Dimitrova
 
PPTX
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
PDF
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
DOCX
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
PDF
Predicting Titanic Survival Presentation
praxyfarhana
 
PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PDF
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
PPTX
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
Cat_Latin_America_in_World_Politics[1].docx
sales480687
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
Parental Leave Policies & Research Bulgaria
Elitsa Dimitrova
 
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Predicting Titanic Survival Presentation
praxyfarhana
 
Kafka Use Cases Real-World Applications
Accentfuture
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
The Influence off Flexible Work Policies
sales480687
 

Introduction to Distributed Computing Engines for Data Processing - Simone Robutti, Radicalbit

  • 1. Milan – July 13 2016 Introduction to Distributed Computing Engines for Data Processing Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti
  • 2. What is a Distributed Computing System It’s the solution to the problem where your RAM is too small and your data are too big and/or too CPU-intensive to be processed on a single machine.
  • 3. What is a Distributed Computing System Solution: a huge, monolithic mainframe.
  • 4. What is a Distributed Computing System Solution: a huge, monolithic mainframe.
  • 5. What is a Distributed Computing System Solution: do your job on a cluster.
  • 6. Distributed vs Parallel Parallel: execute identical tasks (with different data or parameters). Parallel Distributed: do this on multiple machines. Distributed: split a big task into smaller tasks and execute them on multiple machines
  • 7. What is a Distributed Computing System Goal: the programmer should write its programs easily and efficiently without caring about distribution. Issues: a cluster is complex and conceptually very far from a local environment.
  • 8. Hadoop What: the first OSS distributed computing engine. When: 2006 (work began). Where: Google. Why: Google had a lot of data (for that time) to process. They built a solution. Eventually it became a series of papers and got implemented as OSS. How: HDFS (distributed file system), MapReduce (computational abstraction), YARN (resource and cluster manager).
  • 10. Hadoop Today Common in many enterprise environments. Still good enough for many batch processing use cases. HDFS and Yarn widely used by other processing engines. i.e.: ● Log analysis ● Clickstream analysis ● Text processing
  • 11. Spark What: a more generic distributed processing engine for batch and streaming alike. When: 2014 (1.0 Release). Where: Berkeley + Databricks. Why: They aimed for faster and more general processing with better abstractions on top. How: InMemory computing, DAG, RDD, polyglot functional API, libraries out-of-the-box .
  • 12. Resilient Distributed Dataset RDDs hide the underlying distribution of data with a functional API
  • 13. Directed Acyclic Graph The graph is defined by the user. The runtime translates it to operations on distributed data. create filter filter join collect map
  • 14. Spark Today The hot topic everyone talks about. Just entered the phase of maturity, with an already huge and fast- growing ecosystem of libraries, integrations and tools. Widely used as the go-to solution for Big Data (and not-so-Big) use cases. I.e.: ● Recommending systems ● Fraud Detection ● Attack-Detection ● Near-Real time decision-heavy solutions
  • 15. Flink What: a streaming first (with batch on top), low latency distributed processing engine. When: 2016 (1.0 Release). Where: German Research Foundation + dataArtisans. Why: a faster and flexible computational model that could guarantee low latency, high-throughput and fault-tolerance all at the same time. How: streaming-first approach, checkpointing, lazy symbolic computation, powerful optimizations.
  • 16. Flink Today Perceived as an alternative to Spark. Gaining traction for specific use-cases (real-time streaming) but performs well on most generic uses cases. Solid runtime and optimization; API and ecosystem still young. Many big companies already adopted it for fast-data applications. I.e.: ● Real-time precise analytics (counting) ● Real-time model evaluation ● Online Learning solutions
  • 17. Alternative solutions ● Apache Storm/Heron ● Apache Samza ● Apache GearPump ● Apache Apex
  • 18. Following next "The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour" by Gianmario Spacagna, Senior Data Scientist @ Pirelli "Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit