SlideShare a Scribd company logo
MLOps with a Feature Store
Filling the Gap in ML Infrastructure
Fabio Buso
Logical Clocks
@siroibaf
Data Science Milan Meetup
June 4th, 2020
Hopsworks,
cloud-native
& open-source
MLOps
CI/CD for ML models and data.
Feature Store
Definition, storage, and access of features.
Shared Feature Engineering Code
Well versioned feature engineering jobs.
Adhoc Scripts and Jobs
Data and code silos.
Journey to a Feature Store and Beyond
Event DataRaw Data
SQL Data
DATA LAKEDATA PIPELINES FEATURE PIPELINES
MODEL
SERVING
TRAIN & VALIDATE
MONITOR
Data Engineer Data Scientist ML Engineer
End to End ML Pipelines
Event DataRaw Data
SQL Data
DATA LAKE
End to End ML Pipelines
DATA PIPELINES FEATURE PIPELINES
Event DataRaw Data
SQL Data
DATA LAKE
TRAIN & VALIDATE
Hopsworks
FEATURE
STORE
ONLINE MODEL SERVING
BATCH MODEL SCORING
MONITOR
End to End ML Pipelines
DATA PIPELINES FEATURE PIPELINES
Event DataRaw Data
SQL Data
DATA LAKE
TRAIN & VALIDATE
Hopsworks
FEATURE
STORE
ONLINE MODEL SERVING
BATCH MODEL SCORING
BI Platforms
MONITOR
End to End ML Pipelines
DATA PIPELINES FEATURE PIPELINES
● Logical Clocks – Hopsworks (world’s first open source)
● Uber Michelangelo
● Airbnb – Bighead/Zipline
● Comcast
● Twitter
● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery)
● Branch
● Conde Nast
● Facebook FB Learner
● Netflix
Reference: www.featurestore.org
Known Feature Stores in Production
numbers
(in arrays)
numbers
arrays
(of numbers)
one-hot encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary
A Data Engineer’s Perspective on Feature Engineering
Feature Engineering is about Transforming Data
from pyspark.ml.feature import Normalizer
scaledDF = spark.parquet.read(”…”)
l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm")
l1_norm.transform(scaleDF)
Normalize
Feature Engineering is about Transforming Data
ModelFeatures Labels
TRAINING
LabelsFeatures Model
INFERENCE
Feature Store
Get
Get
Consistent Features Between Training and Inference
The Feature Store as an API
Feature Store
Event Data
Snowflake,
Redshift, SQL
Delta Lake
SF3, HDFS,
Online
Feature Store
Offline
Feature Store
Ingest
Data
From
Used
By
Online Apps
Batch Apps
Create Train/Test Data
Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs)Low
Latency
Features
High
Latency
Features
Feature Groups are ingested at different Cadences
Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
Feature Groups are ingested at different Cadences
Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Groups are ingested at different Cadences
Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Groups are ingested at different Cadences
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Profile Updates
Weblogs
Real-time features
Kafka Output
Simplify Ingestion to the Online/Offline Feature Stores by providing a general-purpose DataFrame API.
Feature Groups are ingested at different Cadences
from hops import featurestore as fs
df = # Spark or Pandas Dataframe
# Do feature engineering on ‘df’
# Register Dataframe as FeatureGroup
fs.create_featuregroup (df, ”titanic_df“, description=”Titanic
passengers”,
online=True)
Register a Feature Group with the Feature Store
Features name Pclass Sex Survive Name Balance
Feature Store Concepts
Features name Pclass Sex Survive Name Balance
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
Feature Store Concepts
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
Features, Feature Groups, and Train/Test Datasets are all versioned
Feature Store Concepts
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm, etc
Storage
GCS
Amazon
S3
HopsFS
Features, Feature Groups, and Train/Test Datasets are all versioned
Feature Store Concepts
from hops import featurestore as fs
sample_data = fs.get_features ([“name”, “Pclass”, “Sex”, “Balance,
“Survived”])
fs.create_training_dataset (sample_data, “titanic_training_dataset",
data_format="tfrecords“, training_dataset_version=1)
Create Training Datasets using the Feature Store
Online Application
1.JDBC
1. Build a Feature Vector using the Online Feature Store
Online Feature Store: High Availability & Low-Latency
Online Application
1.JDBC
1. Build a Feature Vector using the Online Feature Store
US-West-1b
MySQL
NDB2
Model
2-20ms
Online Feature Store: High Availability & Low-Latency
US-West-1a
MySQL
NDB1
Model
Online Application
1.JDBC 2.Predict
1. Build a Feature Vector using the Online Feature Store
US-West-1c
MySQL
NDB3
Model
~5-50ms
US-West-1b
MySQL
NDB2
Model
2-20ms
2. Send the Feature Vector to a Model for Prediction
Online Feature Store: High Availability & Low-Latency
Hopsworks
APPLICATIONS
API
DASHBOARDS
HOPSWORKS
DATASOURCES
In Airflow
Apache Beam
Apache Spark
Apache Beam
Apache Spark
Apache Flink
HOPSWORKS
FEATURE
STORE
Pip
Conda
Tensorflow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
HopsFS
Kubernetes
Kafka
+
Spark
Streaming
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Apache
Kafka
ML Infrastructure: The complete Picture
1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Data Warehouse
Data Lake
Feature
Engineering
Offline
Feature Store
Online
Feature Store
Kafka
ML Infrastructure: The complete Picture
1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Train/Test Data
(S3, HDFS, etc)
Data Warehouse
Data Lake
Feature
Engineering
Offline
Feature Store
Feature
Selection
Scoring &
Validation
Train
Online
Feature Store
Experiments
Kafka
Model
Repository
ML Infrastructure: The complete Picture
1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Train/Test Data
(S3, HDFS, etc)
Online
Application
Batch
Application
Data Warehouse
Data Lake
Feature
Engineering
Offline
Feature Store
Feature
Selection
Scoring &
Validation
Train
Model
Serving
Online
Feature Store
Model
Repository
Monitor
Experiments
Deploy
Feature Vector
Kafka
Multi-Worker Training for TensorFlow (using PySpark)
https://databricks.com/session/distributed-deep-learning-with-apache-spark-and-tensorflow
Maggy: Async HParam Tuning and Parallel Ablation Studies (using PySpark)
https://databricks.com/session_eu19/asynchronous-hyperparameter-optimization-with-apache-spark
Project-Based Multi-Tenancy
Implicit Provenance for ML Workflows
Instrument instead of rewrite (TFX, MLFlow) – enabled by a CDC API
Secure Sensitive data on a shared cluster:
Datasets, Hive DBs, Feature Stores, Kafka Topics all private to Projects – but can be shared.
Conda environment per project (sane Python dependency management in a cluster).
More in Hopsworks
Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Managed SAAS platform (currently only on AWS)
Hopsworks.ai
Trying out Hopsworks
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London, EC2A2BB,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
WWW.LOGICALCLOCKS.COM
@hopsworks
http://github.com/logicalclocks/hopsworks
Show us some love!

More Related Content

PDF
Đề tài: Nâng cao chất lượng lao động tại khu công nghiệp Nomura
PDF
Xây dựng khung kiến trúc bảo đảm an toàn thông tin cho doanh nghiệp
DOC
Kế toán tập hợp chi phí sản xuất và tính giá thành sản phẩm - Nhận bài free t...
PDF
Đề tài: chương trình quản lý lớp học của trung tâm ngoại ngữ, HAY
PDF
Luận văn: Nâng cao chất lượng nhân lực tại Công ty Xây lắp Dầu khí
DOC
ÁP DỤNG KPI TẠI VIETTEL QUẢNG NINH : THỰC TRẠNG VÀ ĐỀ XUẤT GIẢI PHÁP CẢI TIẾN
DOC
Ứng dụng ngôn ngữ UML trong phân tích và thiết kế website cho giảng viên Việ...
DOCX
Luận Văn Thạc Sĩ Tạo Động Lực Lao Động Tập Đoàn Viễn Thông Quân Đội
Đề tài: Nâng cao chất lượng lao động tại khu công nghiệp Nomura
Xây dựng khung kiến trúc bảo đảm an toàn thông tin cho doanh nghiệp
Kế toán tập hợp chi phí sản xuất và tính giá thành sản phẩm - Nhận bài free t...
Đề tài: chương trình quản lý lớp học của trung tâm ngoại ngữ, HAY
Luận văn: Nâng cao chất lượng nhân lực tại Công ty Xây lắp Dầu khí
ÁP DỤNG KPI TẠI VIETTEL QUẢNG NINH : THỰC TRẠNG VÀ ĐỀ XUẤT GIẢI PHÁP CẢI TIẾN
Ứng dụng ngôn ngữ UML trong phân tích và thiết kế website cho giảng viên Việ...
Luận Văn Thạc Sĩ Tạo Động Lực Lao Động Tập Đoàn Viễn Thông Quân Đội

What's hot (20)

PDF
Luận văn: Công tác quản trị nhân lực tại Công ty đóng tàu, HOT
PDF
Endüstriyel Yapay Zeka ve Otonom Sistemler
DOCX
Luận Văn Các yếu tố ảnh hưởng đến sự gắn kết của nhân viên
PDF
Nghiên cứu các kỹ thuật xử lý ảnh phục vụ việc nâng cao chất lượng nhận dạng ...
PDF
Thực trạng và giải pháp cho quản trị nguồn nhân lực tại công ty cổ phần điện ...
PDF
Đề tài: Hiệu quả kinh doanh tại Công ty kinh doanh nước sạch, HAY
PPTX
Informatica Products and Usage
PDF
Generative AI in telecom.pdf
PDF
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
PDF
Đề tài: Tạo động lực cho người lao động tại Công ty nông nghiệp
DOCX
Đề tài luận văn 2024 Một số biện pháp nâng cao chất lượng nhân lực tại Công t...
PDF
Neo4j y GenAI
PDF
Đề tài: Giải pháp tối ưu cho bài toán xếp hàng trong vận tải, HAY
DOCX
DỰ ÁN NHÀ MÁY CHẾ BIẾN THỰC PHẨM
DOCX
Đề Tài Khóa luận 2024 Nâng cao hiệu quả sử dụng nguồn nhân lực tại công ty TN...
DOCX
Khóa luận tốt nghiệp Quản trị nguồn nhân lực tại Công ty điện
PPTX
OBIEE ARCHITECTURE.ppt
PDF
Luận văn: Đánh giá thực hiện công việc tại trường Cao đẳng Dược
DOC
Xu ly anh
PDF
Đề tài: Công tác đào tạo nhân lực tại công ty vận tải Phượng Cường
Luận văn: Công tác quản trị nhân lực tại Công ty đóng tàu, HOT
Endüstriyel Yapay Zeka ve Otonom Sistemler
Luận Văn Các yếu tố ảnh hưởng đến sự gắn kết của nhân viên
Nghiên cứu các kỹ thuật xử lý ảnh phục vụ việc nâng cao chất lượng nhận dạng ...
Thực trạng và giải pháp cho quản trị nguồn nhân lực tại công ty cổ phần điện ...
Đề tài: Hiệu quả kinh doanh tại Công ty kinh doanh nước sạch, HAY
Informatica Products and Usage
Generative AI in telecom.pdf
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Đề tài: Tạo động lực cho người lao động tại Công ty nông nghiệp
Đề tài luận văn 2024 Một số biện pháp nâng cao chất lượng nhân lực tại Công t...
Neo4j y GenAI
Đề tài: Giải pháp tối ưu cho bài toán xếp hàng trong vận tải, HAY
DỰ ÁN NHÀ MÁY CHẾ BIẾN THỰC PHẨM
Đề Tài Khóa luận 2024 Nâng cao hiệu quả sử dụng nguồn nhân lực tại công ty TN...
Khóa luận tốt nghiệp Quản trị nguồn nhân lực tại Công ty điện
OBIEE ARCHITECTURE.ppt
Luận văn: Đánh giá thực hiện công việc tại trường Cao đẳng Dược
Xu ly anh
Đề tài: Công tác đào tạo nhân lực tại công ty vận tải Phượng Cường
Ad

Similar to MLOps with a Feature Store: Filling the Gap in ML Infrastructure (20)

PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
PDF
Managed Feature Store for Machine Learning
PDF
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
PDF
Building a Feature Store around Dataframes and Apache Spark
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Hopsworks data engineering melbourne april 2020
PDF
The Feature Store in Hopsworks
PPTX
Feature Store as a Data Foundation for Machine Learning
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PPTX
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
PDF
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
PDF
Unified MLOps: Feature Stores & Model Deployment
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Kim Hammar - FOSDEM 2019 Brussels - Hopsworks Feature store
PDF
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
PDF
Hopsworks MLOps World talk june 21
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
Hopsworks Feature Store 2.0 a new paradigm
Hamburg Data Science Meetup - MLOps with a Feature Store
Managed Feature Store for Machine Learning
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Building a Feature Store around Dataframes and Apache Spark
Berlin buzzwords 2020-feature-store-dowling
Hopsworks data engineering melbourne april 2020
The Feature Store in Hopsworks
Feature Store as a Data Foundation for Machine Learning
Building Hopsworks, a cloud-native managed feature store for machine learning
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
Unified MLOps: Feature Stores & Model Deployment
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Kim Hammar - FOSDEM 2019 Brussels - Hopsworks Feature store
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Hopsworks MLOps World talk june 21
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Hopsworks Feature Store 2.0 a new paradigm
Ad

More from Data Science Milan (20)

PDF
ML & Graph algorithms to prevent financial crime in digital payments
PDF
How to use the Economic Complexity Index to guide innovation plans
PDF
Robustness Metrics for ML Models based on Deep Learning Methods
PDF
"You don't need a bigger boat": serverless MLOps for reasonable companies
PDF
Question generation using Natural Language Processing by QuestGen.AI
PDF
Speed up data preparation for ML pipelines on AWS
PPTX
Serverless machine learning architectures at Helixa
PDF
Reinforcement Learning Overview | Marco Del Pra
PDF
Time Series Classification with Deep Learning | Marco Del Pra
PDF
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
PDF
Audience projection of target consumers over multiple domains a ner and baye...
PDF
Weak supervised learning - Kristina Khvatova
PDF
GANs beyond nice pictures: real value of data generation, Alex Honchar
PDF
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
PDF
3D Point Cloud analysis using Deep Learning
PDF
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
PDF
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
PDF
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
PDF
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
PDF
A view of graph data usage by Cerved
ML & Graph algorithms to prevent financial crime in digital payments
How to use the Economic Complexity Index to guide innovation plans
Robustness Metrics for ML Models based on Deep Learning Methods
"You don't need a bigger boat": serverless MLOps for reasonable companies
Question generation using Natural Language Processing by QuestGen.AI
Speed up data preparation for ML pipelines on AWS
Serverless machine learning architectures at Helixa
Reinforcement Learning Overview | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Audience projection of target consumers over multiple domains a ner and baye...
Weak supervised learning - Kristina Khvatova
GANs beyond nice pictures: real value of data generation, Alex Honchar
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
3D Point Cloud analysis using Deep Learning
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
A view of graph data usage by Cerved

Recently uploaded (20)

PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Getting Started with Data Integration: FME Form 101
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25-Week II
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Getting Started with Data Integration: FME Form 101
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Machine Learning_overview_presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25-Week II

MLOps with a Feature Store: Filling the Gap in ML Infrastructure

  • 1. MLOps with a Feature Store Filling the Gap in ML Infrastructure Fabio Buso Logical Clocks @siroibaf Data Science Milan Meetup June 4th, 2020
  • 3. MLOps CI/CD for ML models and data. Feature Store Definition, storage, and access of features. Shared Feature Engineering Code Well versioned feature engineering jobs. Adhoc Scripts and Jobs Data and code silos. Journey to a Feature Store and Beyond
  • 4. Event DataRaw Data SQL Data DATA LAKEDATA PIPELINES FEATURE PIPELINES MODEL SERVING TRAIN & VALIDATE MONITOR Data Engineer Data Scientist ML Engineer End to End ML Pipelines
  • 5. Event DataRaw Data SQL Data DATA LAKE End to End ML Pipelines DATA PIPELINES FEATURE PIPELINES
  • 6. Event DataRaw Data SQL Data DATA LAKE TRAIN & VALIDATE Hopsworks FEATURE STORE ONLINE MODEL SERVING BATCH MODEL SCORING MONITOR End to End ML Pipelines DATA PIPELINES FEATURE PIPELINES
  • 7. Event DataRaw Data SQL Data DATA LAKE TRAIN & VALIDATE Hopsworks FEATURE STORE ONLINE MODEL SERVING BATCH MODEL SCORING BI Platforms MONITOR End to End ML Pipelines DATA PIPELINES FEATURE PIPELINES
  • 8. ● Logical Clocks – Hopsworks (world’s first open source) ● Uber Michelangelo ● Airbnb – Bighead/Zipline ● Comcast ● Twitter ● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery) ● Branch ● Conde Nast ● Facebook FB Learner ● Netflix Reference: www.featurestore.org Known Feature Stores in Production
  • 9. numbers (in arrays) numbers arrays (of numbers) one-hot encoding Databases Schemas varchar, charsets integer, blob, varbinary A Data Engineer’s Perspective on Feature Engineering
  • 10. Feature Engineering is about Transforming Data
  • 11. from pyspark.ml.feature import Normalizer scaledDF = spark.parquet.read(”…”) l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm") l1_norm.transform(scaleDF) Normalize Feature Engineering is about Transforming Data
  • 12. ModelFeatures Labels TRAINING LabelsFeatures Model INFERENCE Feature Store Get Get Consistent Features Between Training and Inference
  • 13. The Feature Store as an API Feature Store Event Data Snowflake, Redshift, SQL Delta Lake SF3, HDFS, Online Feature Store Offline Feature Store Ingest Data From Used By Online Apps Batch Apps Create Train/Test Data
  • 14. Streaming App pushes click features every 5 secs Streaming App pushes CDC data every 30 secs Pandas App pushes user profile updates every hour Batch App pushes featurized weblogs data every day SQL DW S3, HDFS SQL Event Data Real-Time Data Real-time feature transformations (<2 secs)Low Latency Features High Latency Features Feature Groups are ingested at different Cadences
  • 15. Streaming App pushes click features every 5 secs Streaming App pushes CDC data every 30 secs Pandas App pushes user profile updates every hour Batch App pushes featurized weblogs data every day SQL DW S3, HDFS SQL Event Data Real-Time Data Real-time feature transformations (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App Feature Store Feature Groups are ingested at different Cadences
  • 16. Streaming App pushes click features every 5 secs Streaming App pushes CDC data every 30 secs Pandas App pushes user profile updates every hour Batch App pushes featurized weblogs data every day SQL DW S3, HDFS SQL Event Data Real-Time Data Real-time feature transformations (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App Feature Store No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores. <10ms TBs/PBs Feature Groups are ingested at different Cadences
  • 17. Streaming App pushes click features every 5 secs Streaming App pushes CDC data every 30 secs Pandas App pushes user profile updates every hour Batch App pushes featurized weblogs data every day Online Feature Store Offline Feature Store SQL DW S3, HDFS SQL Event Data Real-Time Data Real-time feature transformations (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App Feature Store No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores. <10ms TBs/PBs Feature Groups are ingested at different Cadences
  • 18. Feature Store ClickFeatureGroup TableFeatureGroup UserFeatureGroup LogsFeatureGroup Event Data SQL DW S3, HDFS SQL DataFrameAPI Kafka Input Flink RTFeatureGroup Online App Train, Batch App User Clicks DB Updates User Profile Updates Weblogs Real-time features Kafka Output Simplify Ingestion to the Online/Offline Feature Stores by providing a general-purpose DataFrame API. Feature Groups are ingested at different Cadences
  • 19. from hops import featurestore as fs df = # Spark or Pandas Dataframe # Do feature engineering on ‘df’ # Register Dataframe as FeatureGroup fs.create_featuregroup (df, ”titanic_df“, description=”Titanic passengers”, online=True) Register a Feature Group with the Feature Store
  • 20. Features name Pclass Sex Survive Name Balance Feature Store Concepts
  • 21. Features name Pclass Sex Survive Name Balance Feature Groups Titanic Passenger List Passenger Bank Account Feature Store Concepts
  • 22. Features name Pclass Sex Survive Name Balance Train / Test Datasets Survivename PClass Sex Balance Join key Feature Groups Titanic Passenger List Passenger Bank Account Features, Feature Groups, and Train/Test Datasets are all versioned Feature Store Concepts
  • 23. Features name Pclass Sex Survive Name Balance Train / Test Datasets Survivename PClass Sex Balance Join key Feature Groups Titanic Passenger List Passenger Bank Account File format .tfrecord .npy .csv .hdf5, .petastorm, etc Storage GCS Amazon S3 HopsFS Features, Feature Groups, and Train/Test Datasets are all versioned Feature Store Concepts
  • 24. from hops import featurestore as fs sample_data = fs.get_features ([“name”, “Pclass”, “Sex”, “Balance, “Survived”]) fs.create_training_dataset (sample_data, “titanic_training_dataset", data_format="tfrecords“, training_dataset_version=1) Create Training Datasets using the Feature Store
  • 25. Online Application 1.JDBC 1. Build a Feature Vector using the Online Feature Store Online Feature Store: High Availability & Low-Latency
  • 26. Online Application 1.JDBC 1. Build a Feature Vector using the Online Feature Store US-West-1b MySQL NDB2 Model 2-20ms Online Feature Store: High Availability & Low-Latency
  • 27. US-West-1a MySQL NDB1 Model Online Application 1.JDBC 2.Predict 1. Build a Feature Vector using the Online Feature Store US-West-1c MySQL NDB3 Model ~5-50ms US-West-1b MySQL NDB2 Model 2-20ms 2. Send the Feature Vector to a Model for Prediction Online Feature Store: High Availability & Low-Latency
  • 28. Hopsworks APPLICATIONS API DASHBOARDS HOPSWORKS DATASOURCES In Airflow Apache Beam Apache Spark Apache Beam Apache Spark Apache Flink HOPSWORKS FEATURE STORE Pip Conda Tensorflow scikit-learn PyTorch Jupyter Notebooks Tensorboard HopsFS Kubernetes Kafka + Spark Streaming Data Preparation & Ingestion Experimentation & Model Training Deploy & Productionalize Apache Kafka
  • 29. ML Infrastructure: The complete Picture 1 Feature Engineering 2 Feature Selection 3 Training & Validation 4 Serving 5 Prediction Data Warehouse Data Lake Feature Engineering Offline Feature Store Online Feature Store Kafka
  • 30. ML Infrastructure: The complete Picture 1 Feature Engineering 2 Feature Selection 3 Training & Validation 4 Serving 5 Prediction Train/Test Data (S3, HDFS, etc) Data Warehouse Data Lake Feature Engineering Offline Feature Store Feature Selection Scoring & Validation Train Online Feature Store Experiments Kafka Model Repository
  • 31. ML Infrastructure: The complete Picture 1 Feature Engineering 2 Feature Selection 3 Training & Validation 4 Serving 5 Prediction Train/Test Data (S3, HDFS, etc) Online Application Batch Application Data Warehouse Data Lake Feature Engineering Offline Feature Store Feature Selection Scoring & Validation Train Model Serving Online Feature Store Model Repository Monitor Experiments Deploy Feature Vector Kafka
  • 32. Multi-Worker Training for TensorFlow (using PySpark) https://databricks.com/session/distributed-deep-learning-with-apache-spark-and-tensorflow Maggy: Async HParam Tuning and Parallel Ablation Studies (using PySpark) https://databricks.com/session_eu19/asynchronous-hyperparameter-optimization-with-apache-spark Project-Based Multi-Tenancy Implicit Provenance for ML Workflows Instrument instead of rewrite (TFX, MLFlow) – enabled by a CDC API Secure Sensitive data on a shared cluster: Datasets, Hive DBs, Feature Stores, Kafka Topics all private to Projects – but can be shared. Conda environment per project (sane Python dependency management in a cluster). More in Hopsworks
  • 33. Full Featured AGPL-v3 License Model Hopsworks Community Kubernetes Support • Model Serving • Other services for robustness (Jupyter, more coming) Authentication (LDAP, Kerberos, OAuth2) Github support Hopsworks Enterprise Managed SAAS platform (currently only on AWS) Hopsworks.ai Trying out Hopsworks
  • 34. Stockholm Box 1263, Isafjordsgatan 22 Kista, Sweden London IDEALondon, 69 Wilson St, London, EC2A2BB, UK Silicon Valley 470 Ramona St Palo Alto California, USA WWW.LOGICALCLOCKS.COM @hopsworks http://github.com/logicalclocks/hopsworks Show us some love!