SlideShare a Scribd company logo
Dr. David Talby
CTO, Pacific AI
BUILD YOUR OWN OPEN SOURCE
DATA SCIENCE PLATFORM
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
AT THE BEGINNING, THERE WAS SEARCH
Integrate Data
ETL
Streaming
Quality
Enrichment
Dataflows
Data Analyst Data Scientist
SCOPE
Discover & Visualize
SQL
Search
Visualization
Dashboards
Real-Time Alerts
Train Models
ML, DL, DM, NLP, …
Explore & Visualize
Train & Optimize
Collaboration
Workflows
Productize Models
Deploy API’s
Publish API’s
CI & CD for Models
Measurement
Feedback
App DeveloperData Engineer
Infrastructure
Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
GOALS
Enterprise Grade
Scales from GB to PB Unified & Modular
Cutting Edge
CONSTRAINTS
No Commercial Software
No Copyleft
No Saas
Built It
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting it All Together
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
APACHE NIFI
NIFI FEATURES
Web-based dataflow user interface
Seamless experience between design, control, feedback, and monitoring
Highly configurable
Loss tolerant vs guaranteed delivery
Low latency vs high throughput
Dynamic prioritization
Flow can be modified at runtime
Back pressure
Data Provenance
Track dataflow from beginning to end
Designed for extension
Build your own processors and more (120+ available out-of-the-box)
Enables rapid development and effective testing
Secure
SSL, SSH, HTTPS, encrypted content, etc...
Multi-tenant authorization and internal authorization/policy management
APACHE SPARK
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
SPARK SQL FEATURES
Distributed SQL Engine
Seamless integration with Spark DataFrames
Standards Compliant
ANSI SQL 2003 support
All 99 queries of TPC-DS supported as of Spark 2.0
High performance
New “Catalyst” cost-based optimizer in Spark 2.2
Project Tungsten: “Joining a Billion Rows per Second on a Laptop”
2.5x performance gains between 1.6 and 2.0
Accessible & Extensible
Python, R, Scala, Java, Hive direct API’s + UDF support
KIBANA
TIMELION
KIBANA FEATURES
Full-text and faceted search
Full text query language: Boolean operators, proximity, boosting
Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination
Time series analysis: aggregates, windowing, offsetting, trending, comparisons
Geospatial search: Search by shape, bounding box, polygon, by distance or range
Visualizations & Dashboards
All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile
Drag & drop creation and editing
Organize visualizations into dashboards
Dashboards can be dynamically filtered by time, queries, filters
Publish, embed and share dashboards
Real-time updates
Performant
Fast interactive queries, faceting and filtering
REST API and clients in all major languages
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
JUPYTER LAB
JUPYTER HUB
ANACONDA
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
OPEN SCORING
MLEAP
KONG API GATEWAY
API Gateway on nginx
Scalable
Modular with plugins
Authentication
Basic Auth, Open ID,
OAuth, HMAC, LDAP, JWT
Security
ACL, CORS, IP Restriction,
Bot Detection, SSL
Traffic Control
Proxy Caching, Rate limit,
Size limits, terminations
Logging & Analytics
Galileo, Datadog, Runscope
TCP, HTTP, File, Syslog, StatsD
COLLABORATION, CI & CD
Plan
Projects, Boards, Issues,
Milestones, Teams
Create
Merge, Preview, Commit,
Branch, Lock, Discuss
Verify
Automated pipelines,
graphs, history, scaling
Package
Built-in container registry
Release
Continuous integration &
continuous deployment
Configure & Monitor
Infrastructure
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
KUBERNETES
Portable Containers
Public, Private, Hybrid,
or Multi-Cloud
Deployment
Automation, Co-Location,
Storage Mounting, Secrets
Auto-*
-Scaling, -Healing, -Restart,
-Placement, -Replication
Rolling Updates
Load Balancing
Service Discovery
Monitoring Resources
Accessing & Ingesting Logs
PROMETHEUS & GRAFANA
KEYCLOAK
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
The Big Picture
• This is a complex, major enterprise platform
• It’s far from free: Cost is in integration, training & ops
• Why open source?
1. Often, outright better technology
2. Faster innovation
3. More native integrations
4. More books, talks, tutorials, posts & answers
5. Cheaper, both to begin and to scale
Common Questions
Q: Do I need it all on Day One?
A: No. Use what you need, know where it fits later.
Q: What if I already have another tool in place?
A: Keep it. Architecture is about incremental evolution.
Q: What if I don’t have the in-house knowledge?
A: Outsource, but require training & onboarding.
Q: What often gets overlooked?
A: Keeping components continuously up to date.
Summary: If you remember one thing…
Build the simplest platform that serves
everyone required to turn science into $$$
Data Analyst Data Scientist App DeveloperData Engineer
david@pacific.ai
@davidtalby
in/davidtalby
THANK YOU!

More Related Content

What's hot (20)

PPTX
Data Day TX 2016 - Jan 16, 2016
Michelle Casbon
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
PDF
Multi runtime serving pipelines for machine learning
Stepan Pushkarev
 
PDF
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Sri Ambati
 
PDF
Introduction to Sparkling Water - Spark Summit East 2016
Sri Ambati
 
PDF
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...
Sri Ambati
 
PDF
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
Databricks
 
PDF
[AI] ML Operationalization with Microsoft Azure
Korkrid Akepanidtaworn
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PDF
Big data and AI in Socialbakers
ppetr82
 
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
PDF
Simplifying AI integration on Apache Spark
Databricks
 
PDF
DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhaba...
Databricks
 
PPTX
Richard Coffey (x18140785) - Research in Computing CA2
Richard Coffey
 
PPTX
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
PPTX
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Sri Ambati
 
PDF
How to Measure DevRel's Perfomances: From Community to Business - Channy Yun ...
Channy Yun
 
PDF
Challenges of Operationalising Data Science in Production
iguazio
 
PDF
Managers guide to effective building of machine learning products
Gianmario Spacagna
 
PDF
Simplify Governance of Streaming Data
confluent
 
Data Day TX 2016 - Jan 16, 2016
Michelle Casbon
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Multi runtime serving pipelines for machine learning
Stepan Pushkarev
 
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Sri Ambati
 
Introduction to Sparkling Water - Spark Summit East 2016
Sri Ambati
 
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...
Sri Ambati
 
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
Databricks
 
[AI] ML Operationalization with Microsoft Azure
Korkrid Akepanidtaworn
 
DevOps for DataScience
Stepan Pushkarev
 
Big data and AI in Socialbakers
ppetr82
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Simplifying AI integration on Apache Spark
Databricks
 
DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhaba...
Databricks
 
Richard Coffey (x18140785) - Research in Computing CA2
Richard Coffey
 
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Sri Ambati
 
How to Measure DevRel's Perfomances: From Community to Business - Channy Yun ...
Channy Yun
 
Challenges of Operationalising Data Science in Production
iguazio
 
Managers guide to effective building of machine learning products
Gianmario Spacagna
 
Simplify Governance of Streaming Data
confluent
 

Similar to Build your open source data science platform (20)

PDF
resume4
James Black
 
DOCX
Venkata Sateesh_BigData_Latest-Resume
venkata sateeshs
 
PPTX
Sparkflows.io
sparkflows
 
DOC
Pavani_Rao
Pavani Rao
 
PDF
Time's Up! Getting Value from Big Data Now
Eric Kavanagh
 
PPTX
Devops Powered by Splunk
Splunk
 
PPTX
Netflix Cloud Architecture and Open Source
aspyker
 
PPTX
Trivandrumtechcon20
Jenkins NS
 
PDF
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
PivotalOpenSourceHub
 
PDF
Data-Driven DevOps: Mining Machine Data for 'Metrics that Matter' in a DevOps...
Splunk
 
PPTX
Azure DevOps Best Practices Webinar
Cambay Digital
 
PDF
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 
PPT
Oracle BI 11g Insync presentation
InSync Conference
 
PPTX
GOTO Berlin 2016
Christian Deger
 
PDF
Which Application Modernization Pattern Is Right For You?
Apigee | Google Cloud
 
PPTX
OCP Datacomm RedHat - Kubernetes Launch
PT Datacomm Diangraha
 
PDF
Sukumar Nayak-Agile-DevOps-Cloud Management
Sukumar Nayak
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PPTX
Machine Learning Models in Production
DataWorks Summit
 
PDF
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
WikibonCommunity
 
resume4
James Black
 
Venkata Sateesh_BigData_Latest-Resume
venkata sateeshs
 
Sparkflows.io
sparkflows
 
Pavani_Rao
Pavani Rao
 
Time's Up! Getting Value from Big Data Now
Eric Kavanagh
 
Devops Powered by Splunk
Splunk
 
Netflix Cloud Architecture and Open Source
aspyker
 
Trivandrumtechcon20
Jenkins NS
 
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
PivotalOpenSourceHub
 
Data-Driven DevOps: Mining Machine Data for 'Metrics that Matter' in a DevOps...
Splunk
 
Azure DevOps Best Practices Webinar
Cambay Digital
 
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
zData Inc.
 
Oracle BI 11g Insync presentation
InSync Conference
 
GOTO Berlin 2016
Christian Deger
 
Which Application Modernization Pattern Is Right For You?
Apigee | Google Cloud
 
OCP Datacomm RedHat - Kubernetes Launch
PT Datacomm Diangraha
 
Sukumar Nayak-Agile-DevOps-Cloud Management
Sukumar Nayak
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Machine Learning Models in Production
DataWorks Summit
 
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
WikibonCommunity
 
Ad

More from David Talby (11)

PPTX
Building State-of-the-art Natural Language Processing Projects with Free Soft...
David Talby
 
PPTX
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
David Talby
 
PPTX
How to Apply NLP to Analyze Clinical Trials
David Talby
 
PPTX
New Frontiers in Applied NLP​ - PAW Healthcare 2022
David Talby
 
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
PPTX
Applying NLP to Personalized Healthcare - 2021
David Talby
 
PPTX
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
David Talby
 
PPTX
Natural Language Understanding in Healthcare
David Talby
 
PPTX
Deep learning for natural language understanding
David Talby
 
PPTX
Natural Language Understanding with Machine Learned Annotators and Deep Learn...
David Talby
 
PPTX
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
David Talby
 
Building State-of-the-art Natural Language Processing Projects with Free Soft...
David Talby
 
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
David Talby
 
How to Apply NLP to Analyze Clinical Trials
David Talby
 
New Frontiers in Applied NLP​ - PAW Healthcare 2022
David Talby
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
Applying NLP to Personalized Healthcare - 2021
David Talby
 
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
David Talby
 
Natural Language Understanding in Healthcare
David Talby
 
Deep learning for natural language understanding
David Talby
 
Natural Language Understanding with Machine Learned Annotators and Deep Learn...
David Talby
 
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
David Talby
 
Ad

Recently uploaded (20)

PPT
Information Communication Technology Concepts
LOIDAALMAZAN3
 
PPTX
Introduction to web development | MERN Stack
JosephLiyon
 
PPTX
Iobit Driver Booster Pro 12 Crack Free Download
chaudhryakashoo065
 
PDF
Cloud computing Lec 02 - virtualization.pdf
asokawennawatte
 
PPTX
CV-Project_2024 version 01222222222.pptx
MohammadSiddiqui70
 
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
 
PDF
How DeepSeek Beats ChatGPT: Cost Comparison and Key Differences
sumitpurohit810
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PPTX
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
PDF
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
PPTX
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
 
PPTX
declaration of Variables and constants.pptx
meemee7378
 
PDF
>Nitro Pro Crack 14.36.1.0 + Keygen Free Download [Latest]
utfefguu
 
PPTX
B2C EXTRANET | EXTRANET WEBSITE | EXTRANET INTEGRATION
philipnathen82
 
PPTX
ERP - FICO Presentation BY BSL BOKARO STEEL LIMITED.pptx
ravisranjan
 
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
PPTX
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
PDF
Code Once; Run Everywhere - A Beginner’s Journey with React Native
Hasitha Walpola
 
PDF
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 
Information Communication Technology Concepts
LOIDAALMAZAN3
 
Introduction to web development | MERN Stack
JosephLiyon
 
Iobit Driver Booster Pro 12 Crack Free Download
chaudhryakashoo065
 
Cloud computing Lec 02 - virtualization.pdf
asokawennawatte
 
CV-Project_2024 version 01222222222.pptx
MohammadSiddiqui70
 
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
 
How DeepSeek Beats ChatGPT: Cost Comparison and Key Differences
sumitpurohit810
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
computer forensics encase emager app exp6 1.pptx
ssuser343e92
 
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
 
Android Notifications-A Guide to User-Facing Alerts in Android .pptx
Nabin Dhakal
 
declaration of Variables and constants.pptx
meemee7378
 
>Nitro Pro Crack 14.36.1.0 + Keygen Free Download [Latest]
utfefguu
 
B2C EXTRANET | EXTRANET WEBSITE | EXTRANET INTEGRATION
philipnathen82
 
ERP - FICO Presentation BY BSL BOKARO STEEL LIMITED.pptx
ravisranjan
 
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
Code Once; Run Everywhere - A Beginner’s Journey with React Native
Hasitha Walpola
 
LPS25 - Operationalizing MLOps in GEP - Terradue.pdf
terradue
 

Build your open source data science platform

  • 1. Dr. David Talby CTO, Pacific AI BUILD YOUR OWN OPEN SOURCE DATA SCIENCE PLATFORM
  • 2. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  • 3. AT THE BEGINNING, THERE WAS SEARCH
  • 4. Integrate Data ETL Streaming Quality Enrichment Dataflows Data Analyst Data Scientist SCOPE Discover & Visualize SQL Search Visualization Dashboards Real-Time Alerts Train Models ML, DL, DM, NLP, … Explore & Visualize Train & Optimize Collaboration Workflows Productize Models Deploy API’s Publish API’s CI & CD for Models Measurement Feedback App DeveloperData Engineer Infrastructure Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
  • 5. GOALS Enterprise Grade Scales from GB to PB Unified & Modular Cutting Edge
  • 6. CONSTRAINTS No Commercial Software No Copyleft No Saas Built It
  • 7. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting it All Together
  • 8. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 10. NIFI FEATURES Web-based dataflow user interface Seamless experience between design, control, feedback, and monitoring Highly configurable Loss tolerant vs guaranteed delivery Low latency vs high throughput Dynamic prioritization Flow can be modified at runtime Back pressure Data Provenance Track dataflow from beginning to end Designed for extension Build your own processors and more (120+ available out-of-the-box) Enables rapid development and effective testing Secure SSL, SSH, HTTPS, encrypted content, etc... Multi-tenant authorization and internal authorization/policy management
  • 12. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 13. SPARK SQL FEATURES Distributed SQL Engine Seamless integration with Spark DataFrames Standards Compliant ANSI SQL 2003 support All 99 queries of TPC-DS supported as of Spark 2.0 High performance New “Catalyst” cost-based optimizer in Spark 2.2 Project Tungsten: “Joining a Billion Rows per Second on a Laptop” 2.5x performance gains between 1.6 and 2.0 Accessible & Extensible Python, R, Scala, Java, Hive direct API’s + UDF support
  • 16. KIBANA FEATURES Full-text and faceted search Full text query language: Boolean operators, proximity, boosting Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination Time series analysis: aggregates, windowing, offsetting, trending, comparisons Geospatial search: Search by shape, bounding box, polygon, by distance or range Visualizations & Dashboards All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile Drag & drop creation and editing Organize visualizations into dashboards Dashboards can be dynamically filtered by time, queries, filters Publish, embed and share dashboards Real-time updates Performant Fast interactive queries, faceting and filtering REST API and clients in all major languages
  • 17. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 21. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 23. MLEAP
  • 24. KONG API GATEWAY API Gateway on nginx Scalable Modular with plugins Authentication Basic Auth, Open ID, OAuth, HMAC, LDAP, JWT Security ACL, CORS, IP Restriction, Bot Detection, SSL Traffic Control Proxy Caching, Rate limit, Size limits, terminations Logging & Analytics Galileo, Datadog, Runscope TCP, HTTP, File, Syslog, StatsD
  • 25. COLLABORATION, CI & CD Plan Projects, Boards, Issues, Milestones, Teams Create Merge, Preview, Commit, Branch, Lock, Discuss Verify Automated pipelines, graphs, history, scaling Package Built-in container registry Release Continuous integration & continuous deployment Configure & Monitor
  • 26. Infrastructure Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer
  • 27. KUBERNETES Portable Containers Public, Private, Hybrid, or Multi-Cloud Deployment Automation, Co-Location, Storage Mounting, Secrets Auto-* -Scaling, -Healing, -Restart, -Placement, -Replication Rolling Updates Load Balancing Service Discovery Monitoring Resources Accessing & Ingesting Logs
  • 30. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  • 31. The Big Picture • This is a complex, major enterprise platform • It’s far from free: Cost is in integration, training & ops • Why open source? 1. Often, outright better technology 2. Faster innovation 3. More native integrations 4. More books, talks, tutorials, posts & answers 5. Cheaper, both to begin and to scale
  • 32. Common Questions Q: Do I need it all on Day One? A: No. Use what you need, know where it fits later. Q: What if I already have another tool in place? A: Keep it. Architecture is about incremental evolution. Q: What if I don’t have the in-house knowledge? A: Outsource, but require training & onboarding. Q: What often gets overlooked? A: Keeping components continuously up to date.
  • 33. Summary: If you remember one thing… Build the simplest platform that serves everyone required to turn science into $$$ Data Analyst Data Scientist App DeveloperData Engineer

Editor's Notes

  • #4: You need a platform if you’re building a set of solutions in the data science space.
  • #6: We’re going open source to get a better solution, not just to save money.
  • #7: The constraints are not meant to say that other types of solutions are worth the money – they often are, but starting with these constraints gives you a baseline of expectations.