Agile data science with scala

Agile Data Science with Scala
by @DataFellas
Xavier Tordoir
xtordoir@data-fellas.guru
@xtordoir
Andy Petrella
noootsab@data-fellas.guru
@noootsab

Data Fellas
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning

© Data Fellas SPRL 2016
● Pipeline: productizing Data Science
● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook)
● Why Micro Services?
● Painful points:
○ Data science is Discontiguous
○ Context Lost in Translation
● Solution: Data Fellas’ Agile Data Science Toolkit
Lineup
So if you’re not sure you want to stay...

Pipeline
Productizing Data Science
Modelling Coding Deploying
Finding Data
Parsing structures
Cleaning
(Reducing)
Learning
Predicting
Connect PROD data
Tuning training parameters
Create Prediction Service
Generate Deployable
Connect to PROD infrastructure
Integration with existing env
Allocate (schedule) resources
Ensure availability

Distributed Data Science
Demo
All-In Spark Notebooks
Get data: Source → Kafka
Prepare View: Kafka → Cassandra
Train Model: Cassandra → ML...
Create Server: Cassandra/ML/... → Akka Http
Create Client: Json → Html Form, Chart, table, ...

Bad Pipeline
Targeting Dashboard
Modelling Coding Deploying Dashboard
»»»
Data Scientist focusing on the dashboard/report instead of content
breaks reusability of data
time wasted on learning viz instead of increasing accuracy (or velocity)
monolithic instead of service oriented

Extended Pipeline
Micro Services
Modelling Coding Deploying Integrating
Application
Creating
Services
Abstracts access to prepared views
Exposes Prediction capabilities
Highly horizontally scalable
Scaling micro services cluster
→ cheaper than computing cluster
Customer integration
Can be any technologies
Can even be another pipeline!

Painful points
Data science is Discontiguous
➔ Highly heterogeneous environment
➔ Too many friction areas
➔ Time to market too long
Modelling Coding Deploying Integrating
Application
Scientist Data Eng. Ops. Eng. Web Eng. Customers
➔ No integration
➔ Error prone
➔ Schedule delays
Creating
Services
Frictions
Result: Lack of Agility
Collecting
Data Eng.

Painful points
Context Lost in Translation
Data Lake Processing
Machine
Learning
Model
Output
Data
Input
Data
No contextual discovery No quality info
No lineage
(origin of the data)
Link to
process and
input discarded
Huge gap in architecture:
binary and schema aware
serving layer
Accuracy depends on
concealed quality of inputs
No schema!
hard and long integration,
poor satisfaction
Moreover:
No backward links → no agility and no context awareness
Result: Lack of Reproducibility
Application

Data Fellas…
Agile Data Science Toolkit

Our Approach
Automatic
Semantics
Engine
+ Autogenerated
Microservices
Integrated
End-to-End
Environment
Huge gain
in Time and
Reliability
+ =
Notebook
Computing
Cluster
Access
Layer
Knowledge
Base
Consumers
Customers
Exposes
database,
learning models,
stream sources,
notebooks, ...
data type
process
lineage
usage
Easy to Release
Easy to (Re)Use
Notebook
Version Control
(Git)
Spark Job Project
(SBT)
Service Projects
(SBT)
Metadata
(Doc, Logic, Schema, ...)
Catalog
(ElasticSearch)
Deployable
(Jar, Docker)
Repository
(Nexus, Docker Repo,
Pypi, Gem Server)
Client Projects
(Node.Js, Java, Scala,
Python, Ruby)
Publishable
(NPM, Jar,
Pip/EasyInstall, Gem)
scientist
data
Engineer
ops
Engineer

In a nutshell

Data Fellas…
Announcements!!!

O’Reilly
Online seminar

Growing
We’re Hiring! http://www.data-fellas.guru/#skillsjobs

Q/A
References
http://www.data-fellas.guru/
http://spark-notebook.io/
https://github.com/andypetrella/spark-notebook/
https://gitter.im/andypetrella/spark-notebook
Come at Strata
-- London at least
-- We have two talks :-)

Agile data science with scala

More Related Content

What's hot (20)

Similar to Agile data science with scala (20)

More from Andy Petrella (20)

Recently uploaded (20)

Agile data science with scala