SlideShare a Scribd company logo
Agile Data Science with Scala
by @DataFellas
Xavier Tordoir
xtordoir@data-fellas.guru
@xtordoir
Andy Petrella
noootsab@data-fellas.guru
@noootsab
Data Fellas
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
© Data Fellas SPRL 2016
● Pipeline: productizing Data Science
● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook)
● Why Micro Services?
● Painful points:
○ Data science is Discontiguous
○ Context Lost in Translation
● Solution: Data Fellas’ Agile Data Science Toolkit
Lineup
So if you’re not sure you want to stay...
© Data Fellas SPRL 2016
Pipeline
Productizing Data Science
Modelling Coding Deploying
Finding Data
Parsing structures
Cleaning
(Reducing)
Learning
Predicting
Connect PROD data
Tuning training parameters
Create Prediction Service
Generate Deployable
Connect to PROD infrastructure
Integration with existing env
Allocate (schedule) resources
Ensure availability
© Data Fellas SPRL 2016
Distributed Data Science
Demo
All-In Spark Notebooks
Get data: Source → Kafka
Prepare View: Kafka → Cassandra
Train Model: Cassandra → ML...
Create Server: Cassandra/ML/... → Akka Http
Create Client: Json → Html Form, Chart, table, ...
© Data Fellas SPRL 2016
Bad Pipeline
Targeting Dashboard
Modelling Coding Deploying Dashboard
»»»
Data Scientist focusing on the dashboard/report instead of content
breaks reusability of data
time wasted on learning viz instead of increasing accuracy (or velocity)
monolithic instead of service oriented
© Data Fellas SPRL 2016
Extended Pipeline
Micro Services
Modelling Coding Deploying Integrating
Application
Creating
Services
Abstracts access to prepared views
Exposes Prediction capabilities
Highly horizontally scalable
Scaling micro services cluster
→ cheaper than computing cluster
Customer integration
Can be any technologies
Can even be another pipeline!
© Data Fellas SPRL 2016
Painful points
Data science is Discontiguous
➔ Highly heterogeneous environment
➔ Too many friction areas
➔ Time to market too long
Modelling Coding Deploying Integrating
Application
Scientist Data Eng. Ops. Eng. Web Eng. Customers
➔ No integration
➔ Error prone
➔ Schedule delays
Creating
Services
Frictions
Result: Lack of Agility
Collecting
Data Eng.
© Data Fellas SPRL 2016
Painful points
Context Lost in Translation
Data Lake Processing
Machine
Learning
Model
Output
Data
Input
Data
No contextual discovery No quality info
No lineage
(origin of the data)
Link to
process and
input discarded
Huge gap in architecture:
binary and schema aware
serving layer
Accuracy depends on
concealed quality of inputs
No schema!
hard and long integration,
poor satisfaction
Moreover:
No backward links → no agility and no context awareness
Result: Lack of Reproducibility
Application
Data Fellas…
Agile Data Science Toolkit
© Data Fellas SPRL 2016
Our Approach
Agile Data Science Toolkit
Automatic
Semantics
Engine
+ Autogenerated
Microservices
Integrated
End-to-End
Environment
Huge gain
in Time and
Reliability
+ =
Notebook
Computing
Cluster
Access
Layer
Knowledge
Base
Consumers
Customers
Exposes
database,
learning models,
stream sources,
notebooks, ...
data type
process
lineage
usage
Easy to Release
Easy to (Re)Use
Notebook
Version Control
(Git)
Spark Job Project
(SBT)
Service Projects
(SBT)
Metadata
(Doc, Logic, Schema, ...)
Catalog
(ElasticSearch)
Deployable
(Jar, Docker)
Repository
(Nexus, Docker Repo,
Pypi, Gem Server)
Client Projects
(Node.Js, Java, Scala,
Python, Ruby)
Publishable
(NPM, Jar,
Pip/EasyInstall, Gem)
scientist
data
Engineer
ops
Engineer
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
Data Fellas…
Announcements!!!
© Data Fellas SPRL 2016
O’Reilly
Online seminar
© Data Fellas SPRL 2016
Growing
We’re Hiring! http://www.data-fellas.guru/#skillsjobs
Q/A
References
http://www.data-fellas.guru/
http://spark-notebook.io/
https://github.com/andypetrella/spark-notebook/
https://gitter.im/andypetrella/spark-notebook
Come at Strata
-- London at least
-- We have two talks :-)

More Related Content

What's hot (20)

PDF
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
PPTX
Machine Learning with Spark
elephantscale
 
PDF
Introduction to Analytics with Azure Notebooks and Python
Jen Stirrup
 
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 
PDF
Deep Learning with MXNet - Dmitry Larko
Sri Ambati
 
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
PPTX
IBM Strategy for Spark
Mark Kerzner
 
PPTX
Analyzing Data With Python
Sarah Guido
 
PPTX
CuRious about R in Power BI? End to end R in Power BI for beginners
Jen Stirrup
 
PPTX
Machine Learning and Hadoop
Josh Patterson
 
PDF
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
 
PPTX
EDHREC @ Data Science MD
Donald Miner
 
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
PDF
Stacked Ensembles in H2O
Sri Ambati
 
PPTX
Skutil - H2O meets Sklearn - Taylor Smith
Sri Ambati
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
Use of standards and related issues in predictive analytics
Paco Nathan
 
PDF
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
PPTX
Spark - Philly JUG
Brian O'Neill
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
Machine Learning with Spark
elephantscale
 
Introduction to Analytics with Azure Notebooks and Python
Jen Stirrup
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 
Deep Learning with MXNet - Dmitry Larko
Sri Ambati
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
IBM Strategy for Spark
Mark Kerzner
 
Analyzing Data With Python
Sarah Guido
 
CuRious about R in Power BI? End to end R in Power BI for beginners
Jen Stirrup
 
Machine Learning and Hadoop
Josh Patterson
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
 
EDHREC @ Data Science MD
Donald Miner
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Stacked Ensembles in H2O
Sri Ambati
 
Skutil - H2O meets Sklearn - Taylor Smith
Sri Ambati
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Use of standards and related issues in predictive analytics
Paco Nathan
 
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Spark - Philly JUG
Brian O'Neill
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 

Similar to Agile data science with scala (20)

PDF
Bringing Deep Learning into production
Paolo Platter
 
PPTX
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
PDF
Engineering Data Pipeline for Data-Driven Analytics
Oluwasegun Matthew
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PDF
Productionizing Data Science at Experience
Matt Mills
 
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
PDF
Data engineering in 10 years.pdf
Lars Albertsson
 
PDF
Balancing Infrastructure with Optimization and Problem Formulation
Alex D. Gaudio
 
PDF
Machine learning model to production
Georg Heiler
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
PDF
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Precisely
 
PDF
Data science a practitioner's perspective
Amir Ziai
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
PDF
Deep learning in production with the best
Adam Gibson
 
PDF
Industrializing Machine learning pipelines
Germain Tanguy
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PDF
From Lab to Factory: Or how to turn data into value
Peadar Coyle
 
PDF
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
mattdenesuk
 
Bringing Deep Learning into production
Paolo Platter
 
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Engineering Data Pipeline for Data-Driven Analytics
Oluwasegun Matthew
 
Demystifying data engineering
Thang Bui (Bob)
 
Productionizing Data Science at Experience
Matt Mills
 
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Data engineering in 10 years.pdf
Lars Albertsson
 
Balancing Infrastructure with Optimization and Problem Formulation
Alex D. Gaudio
 
Machine learning model to production
Georg Heiler
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Precisely
 
Data science a practitioner's perspective
Amir Ziai
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
Deep learning in production with the best
Adam Gibson
 
Industrializing Machine learning pipelines
Germain Tanguy
 
DevOps for DataScience
Stepan Pushkarev
 
From Lab to Factory: Or how to turn data into value
Peadar Coyle
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
mattdenesuk
 
Ad

More from Andy Petrella (20)

PPTX
Data Observability Best Pracices
Andy Petrella
 
PDF
How to Build a Global Data Mapping
Andy Petrella
 
PDF
Interactive notebooks
Andy Petrella
 
PDF
Governance compliance
Andy Petrella
 
PDF
Data science governance and GDPR
Andy Petrella
 
PDF
Data science governance : what and how
Andy Petrella
 
PDF
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
PDF
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella
 
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
 
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
PDF
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
PPTX
Liège créative: Open Science
Andy Petrella
 
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PDF
Spark devoxx2014
Andy Petrella
 
PDF
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
PDF
Machine Learning and GraphX
Andy Petrella
 
PDF
Quanti-litative Revolution in GIS
Andy Petrella
 
PDF
Scala and-fp-in-big-data
Andy Petrella
 
PDF
Software Crafted And Libraries Available
Andy Petrella
 
Data Observability Best Pracices
Andy Petrella
 
How to Build a Global Data Mapping
Andy Petrella
 
Interactive notebooks
Andy Petrella
 
Governance compliance
Andy Petrella
 
Data science governance and GDPR
Andy Petrella
 
Data science governance : what and how
Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
Liège créative: Open Science
Andy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Andy Petrella
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Spark devoxx2014
Andy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Machine Learning and GraphX
Andy Petrella
 
Quanti-litative Revolution in GIS
Andy Petrella
 
Scala and-fp-in-big-data
Andy Petrella
 
Software Crafted And Libraries Available
Andy Petrella
 
Ad

Recently uploaded (20)

PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
Next level data operations using Power Automate magic
Andries den Haan
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
Next level data operations using Power Automate magic
Andries den Haan
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Practical Applications of AI in Local Government
OnBoard
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 

Agile data science with scala

  • 1. Agile Data Science with Scala by @DataFellas Xavier Tordoir [email protected] @xtordoir Andy Petrella [email protected] @noootsab
  • 2. Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning
  • 3. © Data Fellas SPRL 2016 ● Pipeline: productizing Data Science ● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook) ● Why Micro Services? ● Painful points: ○ Data science is Discontiguous ○ Context Lost in Translation ● Solution: Data Fellas’ Agile Data Science Toolkit Lineup So if you’re not sure you want to stay...
  • 4. © Data Fellas SPRL 2016 Pipeline Productizing Data Science Modelling Coding Deploying Finding Data Parsing structures Cleaning (Reducing) Learning Predicting Connect PROD data Tuning training parameters Create Prediction Service Generate Deployable Connect to PROD infrastructure Integration with existing env Allocate (schedule) resources Ensure availability
  • 5. © Data Fellas SPRL 2016 Distributed Data Science Demo All-In Spark Notebooks Get data: Source → Kafka Prepare View: Kafka → Cassandra Train Model: Cassandra → ML... Create Server: Cassandra/ML/... → Akka Http Create Client: Json → Html Form, Chart, table, ...
  • 6. © Data Fellas SPRL 2016 Bad Pipeline Targeting Dashboard Modelling Coding Deploying Dashboard »»» Data Scientist focusing on the dashboard/report instead of content breaks reusability of data time wasted on learning viz instead of increasing accuracy (or velocity) monolithic instead of service oriented
  • 7. © Data Fellas SPRL 2016 Extended Pipeline Micro Services Modelling Coding Deploying Integrating Application Creating Services Abstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster → cheaper than computing cluster Customer integration Can be any technologies Can even be another pipeline!
  • 8. © Data Fellas SPRL 2016 Painful points Data science is Discontiguous ➔ Highly heterogeneous environment ➔ Too many friction areas ➔ Time to market too long Modelling Coding Deploying Integrating Application Scientist Data Eng. Ops. Eng. Web Eng. Customers ➔ No integration ➔ Error prone ➔ Schedule delays Creating Services Frictions Result: Lack of Agility Collecting Data Eng.
  • 9. © Data Fellas SPRL 2016 Painful points Context Lost in Translation Data Lake Processing Machine Learning Model Output Data Input Data No contextual discovery No quality info No lineage (origin of the data) Link to process and input discarded Huge gap in architecture: binary and schema aware serving layer Accuracy depends on concealed quality of inputs No schema! hard and long integration, poor satisfaction Moreover: No backward links → no agility and no context awareness Result: Lack of Reproducibility Application
  • 10. Data Fellas… Agile Data Science Toolkit
  • 11. © Data Fellas SPRL 2016 Our Approach Agile Data Science Toolkit Automatic Semantics Engine + Autogenerated Microservices Integrated End-to-End Environment Huge gain in Time and Reliability + = Notebook Computing Cluster Access Layer Knowledge Base Consumers Customers Exposes database, learning models, stream sources, notebooks, ... data type process lineage usage Easy to Release Easy to (Re)Use Notebook Version Control (Git) Spark Job Project (SBT) Service Projects (SBT) Metadata (Doc, Logic, Schema, ...) Catalog (ElasticSearch) Deployable (Jar, Docker) Repository (Nexus, Docker Repo, Pypi, Gem Server) Client Projects (Node.Js, Java, Scala, Python, Ruby) Publishable (NPM, Jar, Pip/EasyInstall, Gem) scientist data Engineer ops Engineer
  • 12. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 13. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 14. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 15. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 16. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 17. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 19. © Data Fellas SPRL 2016 O’Reilly Online seminar
  • 20. © Data Fellas SPRL 2016 Growing We’re Hiring! http://www.data-fellas.guru/#skillsjobs