Dr. David Talby
CTO, Pacific AI
BUILD YOUR OWN OPEN SOURCE
DATA SCIENCE PLATFORM
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
AT THE BEGINNING, THERE WAS SEARCH
Integrate Data
ETL
Streaming
Quality
Enrichment
Dataflows
Data Analyst Data Scientist
SCOPE
Discover & Visualize
SQL
Search
Visualization
Dashboards
Real-Time Alerts
Train Models
ML, DL, DM, NLP, …
Explore & Visualize
Train & Optimize
Collaboration
Workflows
Productize Models
Deploy API’s
Publish API’s
CI & CD for Models
Measurement
Feedback
App DeveloperData Engineer
Infrastructure
Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
GOALS
Enterprise Grade
Scales from GB to PB Unified & Modular
Cutting Edge
CONSTRAINTS
No Commercial Software
No Copyleft
No Saas
Built It
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting it All Together
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
APACHE NIFI
NIFI FEATURES
Web-based dataflow user interface
Seamless experience between design, control, feedback, and monitoring
Highly configurable
Loss tolerant vs guaranteed delivery
Low latency vs high throughput
Dynamic prioritization
Flow can be modified at runtime
Back pressure
Data Provenance
Track dataflow from beginning to end
Designed for extension
Build your own processors and more (120+ available out-of-the-box)
Enables rapid development and effective testing
Secure
SSL, SSH, HTTPS, encrypted content, etc...
Multi-tenant authorization and internal authorization/policy management
APACHE SPARK
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
SPARK SQL FEATURES
Distributed SQL Engine
Seamless integration with Spark DataFrames
Standards Compliant
ANSI SQL 2003 support
All 99 queries of TPC-DS supported as of Spark 2.0
High performance
New “Catalyst” cost-based optimizer in Spark 2.2
Project Tungsten: “Joining a Billion Rows per Second on a Laptop”
2.5x performance gains between 1.6 and 2.0
Accessible & Extensible
Python, R, Scala, Java, Hive direct API’s + UDF support
KIBANA
TIMELION
KIBANA FEATURES
Full-text and faceted search
Full text query language: Boolean operators, proximity, boosting
Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination
Time series analysis: aggregates, windowing, offsetting, trending, comparisons
Geospatial search: Search by shape, bounding box, polygon, by distance or range
Visualizations & Dashboards
All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile
Drag & drop creation and editing
Organize visualizations into dashboards
Dashboards can be dynamically filtered by time, queries, filters
Publish, embed and share dashboards
Real-time updates
Performant
Fast interactive queries, faceting and filtering
REST API and clients in all major languages
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
JUPYTER LAB
JUPYTER HUB
ANACONDA
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
Infrastructure
OPEN SCORING
MLEAP
KONG API GATEWAY
API Gateway on nginx
Scalable
Modular with plugins
Authentication
Basic Auth, Open ID,
OAuth, HMAC, LDAP, JWT
Security
ACL, CORS, IP Restriction,
Bot Detection, SSL
Traffic Control
Proxy Caching, Rate limit,
Size limits, terminations
Logging & Analytics
Galileo, Datadog, Runscope
TCP, HTTP, File, Syslog, StatsD
COLLABORATION, CI & CD
Plan
Projects, Boards, Issues,
Milestones, Teams
Create
Merge, Preview, Commit,
Branch, Lock, Discuss
Verify
Automated pipelines,
graphs, history, scaling
Package
Built-in container registry
Release
Continuous integration &
continuous deployment
Configure & Monitor
Infrastructure
Integrate Data
Data Analyst Data Scientist
SCOPE
Discover & Visualize Train Models Productize Models
App DeveloperData Engineer
KUBERNETES
Portable Containers
Public, Private, Hybrid,
or Multi-Cloud
Deployment
Automation, Co-Location,
Storage Mounting, Secrets
Auto-*
-Scaling, -Healing, -Restart,
-Placement, -Replication
Rolling Updates
Load Balancing
Service Discovery
Monitoring Resources
Accessing & Ingesting Logs
PROMETHEUS & GRAFANA
KEYCLOAK
LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together
The Big Picture
• This is a complex, major enterprise platform
• It’s far from free: Cost is in integration, training & ops
• Why open source?
1. Often, outright better technology
2. Faster innovation
3. More native integrations
4. More books, talks, tutorials, posts & answers
5. Cheaper, both to begin and to scale
Common Questions
Q: Do I need it all on Day One?
A: No. Use what you need, know where it fits later.
Q: What if I already have another tool in place?
A: Keep it. Architecture is about incremental evolution.
Q: What if I don’t have the in-house knowledge?
A: Outsource, but require training & onboarding.
Q: What often gets overlooked?
A: Keeping components continuously up to date.
Summary: If you remember one thing…
Build the simplest platform that serves
everyone required to turn science into $$$
Data Analyst Data Scientist App DeveloperData Engineer
david@pacific.ai
@davidtalby
in/davidtalby
THANK YOU!

More Related Content

PPTX
Architecting an Open Source AI Platform 2018 edition
PPTX
Architecting a Predictive, Petabyte-Scale, Self-Learning Fraud Detection System
PDF
Intro to Machine Learning with H2O and Python - Denver
PPTX
Microsoft Graph community call-November 2018
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PPTX
Blind spots in big data erez koren @ forter
PPTX
Splunk Quick Overview for Emirates Travel Hackathon
PPTX
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Architecting an Open Source AI Platform 2018 edition
Architecting a Predictive, Petabyte-Scale, Self-Learning Fraud Detection System
Intro to Machine Learning with H2O and Python - Denver
Microsoft Graph community call-November 2018
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Blind spots in big data erez koren @ forter
Splunk Quick Overview for Emirates Travel Hackathon
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018

What's hot (20)

PPTX
Data Day TX 2016 - Jan 16, 2016
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PDF
Multi runtime serving pipelines for machine learning
PDF
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
PDF
Introduction to Sparkling Water - Spark Summit East 2016
PDF
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...
PDF
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
PDF
[AI] ML Operationalization with Microsoft Azure
PDF
DevOps for DataScience
PDF
Big data and AI in Socialbakers
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
PDF
Simplifying AI integration on Apache Spark
PDF
DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhaba...
PPTX
Richard Coffey (x18140785) - Research in Computing CA2
PPTX
Next.ml Boston: Data Science Dev Ops
PPTX
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
PDF
How to Measure DevRel's Perfomances: From Community to Business - Channy Yun ...
PDF
Challenges of Operationalising Data Science in Production
PDF
Managers guide to effective building of machine learning products
PDF
Simplify Governance of Streaming Data
Data Day TX 2016 - Jan 16, 2016
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Multi runtime serving pipelines for machine learning
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Introduction to Sparkling Water - Spark Summit East 2016
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
[AI] ML Operationalization with Microsoft Azure
DevOps for DataScience
Big data and AI in Socialbakers
Hamburg Data Science Meetup - MLOps with a Feature Store
Simplifying AI integration on Apache Spark
DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks with Dhaba...
Richard Coffey (x18140785) - Research in Computing CA2
Next.ml Boston: Data Science Dev Ops
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
How to Measure DevRel's Perfomances: From Community to Business - Channy Yun ...
Challenges of Operationalising Data Science in Production
Managers guide to effective building of machine learning products
Simplify Governance of Streaming Data
Ad

Similar to Build your open source data science platform (20)

PDF
resume4
DOCX
Venkata Sateesh_BigData_Latest-Resume
PPTX
Sparkflows.io
DOC
Pavani_Rao
PDF
Time's Up! Getting Value from Big Data Now
PPTX
Devops Powered by Splunk
PPTX
Netflix Cloud Architecture and Open Source
PPTX
Trivandrumtechcon20
PDF
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
PDF
Data-Driven DevOps: Mining Machine Data for 'Metrics that Matter' in a DevOps...
PPTX
Azure DevOps Best Practices Webinar
PDF
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
PPT
Oracle BI 11g Insync presentation
PPTX
GOTO Berlin 2016
PDF
Which Application Modernization Pattern Is Right For You?
PPTX
OCP Datacomm RedHat - Kubernetes Launch
PDF
Sukumar Nayak-Agile-DevOps-Cloud Management
PPTX
Software engineering practices for the data science and machine learning life...
PPTX
Machine Learning Models in Production
PDF
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
resume4
Venkata Sateesh_BigData_Latest-Resume
Sparkflows.io
Pavani_Rao
Time's Up! Getting Value from Big Data Now
Devops Powered by Splunk
Netflix Cloud Architecture and Open Source
Trivandrumtechcon20
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Data-Driven DevOps: Mining Machine Data for 'Metrics that Matter' in a DevOps...
Azure DevOps Best Practices Webinar
zData BI & Advanced Analytics Platform + 8 Week Pilot Programs
Oracle BI 11g Insync presentation
GOTO Berlin 2016
Which Application Modernization Pattern Is Right For You?
OCP Datacomm RedHat - Kubernetes Launch
Sukumar Nayak-Agile-DevOps-Cloud Management
Software engineering practices for the data science and machine learning life...
Machine Learning Models in Production
Systems of Intelligence: The Biggest Change in Enterprise Applications in 50 ...
Ad

More from David Talby (11)

PPTX
Building State-of-the-art Natural Language Processing Projects with Free Soft...
PPTX
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
PPTX
How to Apply NLP to Analyze Clinical Trials
PPTX
New Frontiers in Applied NLP​ - PAW Healthcare 2022
PPTX
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
PPTX
Applying NLP to Personalized Healthcare - 2021
PPTX
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
PPTX
Natural Language Understanding in Healthcare
PPTX
Deep learning for natural language understanding
PPTX
Natural Language Understanding with Machine Learned Annotators and Deep Learn...
PPTX
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...
Building State-of-the-art Natural Language Processing Projects with Free Soft...
Turning Medical Expert Knowledge into Responsible Language Models - K1st World
How to Apply NLP to Analyze Clinical Trials
New Frontiers in Applied NLP​ - PAW Healthcare 2022
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Applying NLP to Personalized Healthcare - 2021
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
Natural Language Understanding in Healthcare
Deep learning for natural language understanding
Natural Language Understanding with Machine Learned Annotators and Deep Learn...
Semantic Natural Language Understanding with Spark, UIMA & Machine Learned On...

Recently uploaded (20)

PDF
Topaz Photo AI Crack New Download (Latest 2025)
PPTX
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Python is a high-level, interpreted programming language
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
Workplace Software and Skills - OpenStax
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PPTX
Introduction to Windows Operating System
PDF
Visual explanation of Dijkstra's Algorithm using Python
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
Internet Download Manager IDM Crack powerful download accelerator New Version...
PPTX
Lecture 5 Software Requirement Engineering
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Trending Python Topics for Data Visualization in 2025
Topaz Photo AI Crack New Download (Latest 2025)
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
GSA Content Generator Crack (2025 Latest)
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Python is a high-level, interpreted programming language
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
Workplace Software and Skills - OpenStax
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Wondershare Recoverit Full Crack New Version (Latest 2025)
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Introduction to Windows Operating System
Visual explanation of Dijkstra's Algorithm using Python
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
Internet Download Manager IDM Crack powerful download accelerator New Version...
Lecture 5 Software Requirement Engineering
BoxLang Dynamic AWS Lambda - Japan Edition
How to Use SharePoint as an ISO-Compliant Document Management System
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Trending Python Topics for Data Visualization in 2025

Build your open source data science platform

  • 1. Dr. David Talby CTO, Pacific AI BUILD YOUR OWN OPEN SOURCE DATA SCIENCE PLATFORM
  • 2. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  • 3. AT THE BEGINNING, THERE WAS SEARCH
  • 4. Integrate Data ETL Streaming Quality Enrichment Dataflows Data Analyst Data Scientist SCOPE Discover & Visualize SQL Search Visualization Dashboards Real-Time Alerts Train Models ML, DL, DM, NLP, … Explore & Visualize Train & Optimize Collaboration Workflows Productize Models Deploy API’s Publish API’s CI & CD for Models Measurement Feedback App DeveloperData Engineer Infrastructure Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling
  • 5. GOALS Enterprise Grade Scales from GB to PB Unified & Modular Cutting Edge
  • 6. CONSTRAINTS No Commercial Software No Copyleft No Saas Built It
  • 7. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting it All Together
  • 8. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 10. NIFI FEATURES Web-based dataflow user interface Seamless experience between design, control, feedback, and monitoring Highly configurable Loss tolerant vs guaranteed delivery Low latency vs high throughput Dynamic prioritization Flow can be modified at runtime Back pressure Data Provenance Track dataflow from beginning to end Designed for extension Build your own processors and more (120+ available out-of-the-box) Enables rapid development and effective testing Secure SSL, SSH, HTTPS, encrypted content, etc... Multi-tenant authorization and internal authorization/policy management
  • 12. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 13. SPARK SQL FEATURES Distributed SQL Engine Seamless integration with Spark DataFrames Standards Compliant ANSI SQL 2003 support All 99 queries of TPC-DS supported as of Spark 2.0 High performance New “Catalyst” cost-based optimizer in Spark 2.2 Project Tungsten: “Joining a Billion Rows per Second on a Laptop” 2.5x performance gains between 1.6 and 2.0 Accessible & Extensible Python, R, Scala, Java, Hive direct API’s + UDF support
  • 16. KIBANA FEATURES Full-text and faceted search Full text query language: Boolean operators, proximity, boosting Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination Time series analysis: aggregates, windowing, offsetting, trending, comparisons Geospatial search: Search by shape, bounding box, polygon, by distance or range Visualizations & Dashboards All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile Drag & drop creation and editing Organize visualizations into dashboards Dashboards can be dynamically filtered by time, queries, filters Publish, embed and share dashboards Real-time updates Performant Fast interactive queries, faceting and filtering REST API and clients in all major languages
  • 17. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 21. Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer Infrastructure
  • 23. MLEAP
  • 24. KONG API GATEWAY API Gateway on nginx Scalable Modular with plugins Authentication Basic Auth, Open ID, OAuth, HMAC, LDAP, JWT Security ACL, CORS, IP Restriction, Bot Detection, SSL Traffic Control Proxy Caching, Rate limit, Size limits, terminations Logging & Analytics Galileo, Datadog, Runscope TCP, HTTP, File, Syslog, StatsD
  • 25. COLLABORATION, CI & CD Plan Projects, Boards, Issues, Milestones, Teams Create Merge, Preview, Commit, Branch, Lock, Discuss Verify Automated pipelines, graphs, history, scaling Package Built-in container registry Release Continuous integration & continuous deployment Configure & Monitor
  • 26. Infrastructure Integrate Data Data Analyst Data Scientist SCOPE Discover & Visualize Train Models Productize Models App DeveloperData Engineer
  • 27. KUBERNETES Portable Containers Public, Private, Hybrid, or Multi-Cloud Deployment Automation, Co-Location, Storage Mounting, Secrets Auto-* -Scaling, -Healing, -Restart, -Placement, -Replication Rolling Updates Load Balancing Service Discovery Monitoring Resources Accessing & Ingesting Logs
  • 30. LET’S BUILD A PLATFORM 1. Ground Rules 2. Components 3. Putting It All Together
  • 31. The Big Picture • This is a complex, major enterprise platform • It’s far from free: Cost is in integration, training & ops • Why open source? 1. Often, outright better technology 2. Faster innovation 3. More native integrations 4. More books, talks, tutorials, posts & answers 5. Cheaper, both to begin and to scale
  • 32. Common Questions Q: Do I need it all on Day One? A: No. Use what you need, know where it fits later. Q: What if I already have another tool in place? A: Keep it. Architecture is about incremental evolution. Q: What if I don’t have the in-house knowledge? A: Outsource, but require training & onboarding. Q: What often gets overlooked? A: Keeping components continuously up to date.
  • 33. Summary: If you remember one thing… Build the simplest platform that serves everyone required to turn science into $$$ Data Analyst Data Scientist App DeveloperData Engineer

Editor's Notes

  • #4: You need a platform if you’re building a set of solutions in the data science space.
  • #6: We’re going open source to get a better solution, not just to save money.
  • #7: The constraints are not meant to say that other types of solutions are worth the money – they often are, but starting with these constraints gives you a baseline of expectations.