Build your open source data science platform

Dr. David Talby
CTO, Pacific AI
BUILD YOUR OWN OPEN SOURCE
DATA SCIENCE PLATFORM

LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting It All Together

AT THE BEGINNING, THERE WAS SEARCH

Integrate Data
ETL
Streaming
Quality
Enrichment
Dataflows
Data Analyst Data Scientist
SCOPE
Discover & Visualize
SQL
Search
Visualization
Dashboards
Real-Time Alerts
Train Models
ML, DL, DM, NLP, …
Explore & Visualize
Train & Optimize
Collaboration
Workflows
Productize Models
Deploy API’s
Publish API’s
CI & CD for Models
Measurement
Feedback
App DeveloperData Engineer
Infrastructure
Deployment Orchestration Security Monitoring Single Sign-On Backup Scaling

GOALS
Enterprise Grade
Scales from GB to PB Unified & Modular
Cutting Edge

CONSTRAINTS
No Commercial Software
No Copyleft
No Saas
Built It

LET’S BUILD A PLATFORM
1. Ground Rules
2. Components
3. Putting it All Together

Integrate Data
SCOPE
Discover & Visualize Train Models Productize Models
Infrastructure

NIFI FEATURES
Web-based dataflow user interface
Seamless experience between design, control, feedback, and monitoring
Highly configurable
Loss tolerant vs guaranteed delivery
Low latency vs high throughput
Dynamic prioritization
Flow can be modified at runtime
Back pressure
Data Provenance
Track dataflow from beginning to end
Designed for extension
Build your own processors and more (120+ available out-of-the-box)
Enables rapid development and effective testing
Secure
SSL, SSH, HTTPS, encrypted content, etc...
Multi-tenant authorization and internal authorization/policy management

SPARK SQL FEATURES
Distributed SQL Engine
Seamless integration with Spark DataFrames
Standards Compliant
ANSI SQL 2003 support
All 99 queries of TPC-DS supported as of Spark 2.0
High performance
New “Catalyst” cost-based optimizer in Spark 2.2
Project Tungsten: “Joining a Billion Rows per Second on a Laptop”
2.5x performance gains between 1.6 and 2.0
Accessible & Extensible
Python, R, Scala, Java, Hive direct API’s + UDF support

KIBANA FEATURES
Full-text and faceted search
Full text query language: Boolean operators, proximity, boosting
Faceted search: Filter by field, value ranges, date ranges, sort, limit, pagination
Time series analysis: aggregates, windowing, offsetting, trending, comparisons
Geospatial search: Search by shape, bounding box, polygon, by distance or range
Visualizations & Dashboards
All the basics: Area, pie, bar, heatmap, table, metric, map, scatter, timeline, tile
Drag & drop creation and editing
Organize visualizations into dashboards
Dashboards can be dynamically filtered by time, queries, filters
Publish, embed and share dashboards
Real-time updates
Performant
Fast interactive queries, faceting and filtering
REST API and clients in all major languages

KONG API GATEWAY
API Gateway on nginx
Scalable
Modular with plugins
Authentication
Basic Auth, Open ID,
OAuth, HMAC, LDAP, JWT
Security
ACL, CORS, IP Restriction,
Bot Detection, SSL
Traffic Control
Proxy Caching, Rate limit,
Size limits, terminations
Logging & Analytics
Galileo, Datadog, Runscope
TCP, HTTP, File, Syslog, StatsD

COLLABORATION, CI & CD
Plan
Projects, Boards, Issues,
Milestones, Teams
Create
Merge, Preview, Commit,
Branch, Lock, Discuss
Verify
Automated pipelines,
graphs, history, scaling
Package
Built-in container registry
Release
Continuous integration &
continuous deployment
Configure & Monitor

Infrastructure
Integrate Data
SCOPE
Discover & Visualize Train Models Productize Models

KUBERNETES
Portable Containers
Public, Private, Hybrid,
or Multi-Cloud
Deployment
Automation, Co-Location,
Storage Mounting, Secrets
Auto-*
-Scaling, -Healing, -Restart,
-Placement, -Replication
Rolling Updates
Load Balancing
Service Discovery
Monitoring Resources
Accessing & Ingesting Logs

The Big Picture
• This is a complex, major enterprise platform
• It’s far from free: Cost is in integration, training & ops
• Why open source?
1. Often, outright better technology
2. Faster innovation
3. More native integrations
4. More books, talks, tutorials, posts & answers
5. Cheaper, both to begin and to scale

Common Questions
Q: Do I need it all on Day One?
A: No. Use what you need, know where it fits later.
Q: What if I already have another tool in place?
A: Keep it. Architecture is about incremental evolution.
Q: What if I don’t have the in-house knowledge?
A: Outsource, but require training & onboarding.
Q: What often gets overlooked?
A: Keeping components continuously up to date.

Summary: If you remember one thing…
Build the simplest platform that serves
everyone required to turn science into $$$
Data Analyst Data Scientist App DeveloperData Engineer

david@pacific.ai
@davidtalby
in/davidtalby
THANK YOU!

Build your open source data science platform

More Related Content

What's hot (20)

Similar to Build your open source data science platform (20)

More from David Talby (11)

Recently uploaded (20)

Build your open source data science platform

Editor's Notes