SlideShare a Scribd company logo
1
Overcoming
DataOps hurdles for
ML in Production
August 2020
SANDEEP UTTAMCHANDANI
CHIEF DATA OFFICER and VP OF ENGINEERING
sandeep@unraveldata.com
2
Behind the scenes of a ML Model in Production
3
DATA ML Model in
Production
Discover Prep Build Operationalize
DataOps
4
Top 10 DataOps Battlescars
Levels of
Automation
Gather technical metadata
Gather operational metadata
Aggregate tribal
knowledge
1. “I thought the attribute means something else”
Battlescar:
Incorrect assumptions about the meaning of attributes, whether it is the
source of truth, owner/common users, versioning, whether dataset is
trustworthy?
Metric:
Time to
Interpret
Building a Self-Service Metadata Catalog
1. “I thought the attribute means something else?”
Battlescar:
Incorrect assumptions about the meaning of attributes, whether it is the
source of truth, owner/common users, versioning, whether dataset is
trustworthy?
Metric:
Time to
Interpret
Building a Self-Service Metadata Catalog
Intuit
7
2. “Where is the dataset I need for my model?”
Battlescar:
Building a customer support forecasting model. Data was silo’ed across
business units. 4+ months of connecting to data stewards to locate the data
attributes required for building the model
Building a Self-Service Search Service
Levels of
Automation
Indexing of datasets &
artifacts
Search Relevance ranking
Access control of
search results
Metric:
Time to
Find
8
Battlescar:
Building a customer support forecasting model. Data was silo’ed across
business units. 4+ months of connecting to data stewards to locate the data
attributes required for building the model
Building a Self-Service Search Service
Metric:
Time to
Find
2. “Where is the dataset I need for my model?”
9
3. “1000 rows in source database -- why only 50 rows in
data lake?”
Battlescar:
Issues in correctness, completeness, timeliness in moving data
daily/hourly from transactional datastores to centralized data lake
Metric:
Time to
Move
Building a Self-Service Data Movement service
Data Ingestion Configuration
Data Transformation
Change Mgt
Levels of
Automation
10
4. “Job completed but dashboard graphs have data missing?”
Battlescar:
Jobs are orchestrated using schedulers (such as Airflow, Oozie). Several
times, the job dependencies are incorrect, leading to reporting or model
training jobs to be triggered prematurely.
Metric:
Time to
Orchestrate
Building a Self-Service orchestration Service
Levels of
Automation
Defining Job Dependencies
Robust Job Execution
Production
Monitoring
11
5. “Data processing was supposed to complete at 8 am. Its 4pm
and my model retraining job is still waiting?”
Battlescar:
Writing efficient Big Data processing applications is non-trivial. With
plethora of technologies, gaining broad expertise is difficult even for
expert data engineers.
Metric:
Time to
Optimize
Building a Self-Service query optimization Service
Levels of
Automation
Aggregating query, cluster,
resource Stats
Analyzing & correlating
stats
Tuning Jobs
12
6. “Customer changed preference to no marketing emails. Why are
we still including in email campaigns?”
Battlescar:
Without a consistent primary key to identify the customer across data
silos, where recurring issues arise. Emerging Data Rights such as
GDPR, CCPA, require complying with customer preferences on what
data is collected, how it is used, deleted on request.
Metric:
Time to
Comply
Building a Self-Service data rights governance Service
Levels of
Automation
Tracking customer data lifecycle
and preferences
Executing customer’s
data rights requests
Use-case
based access
control
13
7. “Job pipeline ran for 15 hours and now we detect data
quality issue upon completion -- could we be proactive?”
Battlescar:
Data issues in a long running business critical job leads to missing
insights. Only when results don't look correct that we realize there is an
issue.
Metric:
Time to
Insights
Quality
Building a Self-Service data observability Service
Levels of
Automation
Verify accuracy of data
Detect anomalies
Avoid data
quality issues
14
8. “Using the best polyglot datastores -- how do I now write
queries effectively across this data?”
Battlescar:
Significant time spent in planning, design, and writing queries that
process data across datastores
Metric:
Time to
Virtualize
Datastores
Building a Self-Service data virtualization Service
Levels of
Automation
Automatic query routing
Managing datastore
specific queries
Joining across
transactional
sources
15
9. “I ran a A/B experiment -- need to build time-consuming
data pipelines to now analyze the data”
Battlescar:
Analyzing experimental results in a consistent fashion is a nightmare. No
consistent definitions between metrics used for experimental analysis
and business reporting
Metric:
Time to A/B
Test
Building a Self-Service A/B Testing Service
Levels of
Automation
16
10. “Data processing jobs last week cost us 30% more. Why?”
Battlescar:
Especially in the cloud, $ cost is linear to usage. Tracking budgets and
spend to effectively optimize requires non-trivial effort.
Metric:
Time to
Cost
Governance
Building a Self-Service cost governance Service
Levels of
Automation
Expenditure Observability
Matching
Supply-Demand
Continuous Cost
Optimization
17
Wrap up: Advice on Managing your DataOps
18
People
Process Technology
DataOps hurdles vary and depends on...
19
Self-Service has levels (not binary)
20
Discover Prep Build Operationalize
TIME-TO-INSIGHT
Measuring Current DataOps:
Time-to-Insight Metric
DATA
21
Discover Prep Build Operationalize
Time-to-Insight Scorecard
22
Discover Prep Build Operationalize
Creating Your Time-to-Insight Scorecard
WeeksDaysHoursLegend:
23
Call for Action: Making DataOps Self-Service
1. Measure
Create your
Time-to-Insight Scorecard
Self-Service
DataOps
2. Learn
Shortlist 1-2 scorecard
metrics to improve level
of automation
3. Build
Implement well-known
design patterns in your
data platform to make the
metrics self-service
24
Upcoming Book: The Self-Service Data Roadmap
Available Sept’20
Early Release Available on O’Reilly:
https://www.oreilly.com/library/view/the-self-service-data/9781492075240/
25
CONTACT US TO SCHEDULE A DATA OPERATIONS HEALTH CHECK TODAY
hello@unraveldata.com

More Related Content

PPTX
Your Data Nerd Friends Need You!
PDF
seven steps to dataops @ dataops.rocks conference Oct 2019
PPTX
Low-tech, Low-cost data management: Six insights from national reporting on f...
PPTX
ODSC May 2019 - The DataOps Manifesto
PDF
Do Agile Data in Just 5 Shocking Steps!
PDF
Introdution to Dataops and AIOps (or MLOps)
PDF
Data kitchen 7 agile steps - big data fest 9-18-2015
PDF
Bridged Overview by CodeData
Your Data Nerd Friends Need You!
seven steps to dataops @ dataops.rocks conference Oct 2019
Low-tech, Low-cost data management: Six insights from national reporting on f...
ODSC May 2019 - The DataOps Manifesto
Do Agile Data in Just 5 Shocking Steps!
Introdution to Dataops and AIOps (or MLOps)
Data kitchen 7 agile steps - big data fest 9-18-2015
Bridged Overview by CodeData

What's hot (20)

PPTX
The Importance of DataOps in a Multi-Cloud World
PPTX
Talend 6.1 - What's New in Talend?
PPTX
Moving to the Cloud: Modernizing Data Architecture in Healthcare
PDF
Webinar: The Death of Traditional Data Integration
PPTX
Mike Tuche, CEO of Talend: Enabling the Data Driven Enterprise
PDF
Embracing Cloud Agility to Maximize Flexibility & Performance
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
 
PPTX
Data Engineering Efficiency @ Netflix - Strata 2017
PDF
Unleash the Power of Big Data and Machine Learning
PDF
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
 
PDF
5 Simple Steps to Unleash Big Data Talend Connect
PDF
The Model Enterprise: A Blueprint for Enterprise Data Governance
PDF
Webinar: The 5 Most Critical Things to Understand About Modern Data Integration
PPTX
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
PDF
Achieving Agility and Scale for Your Data Lake - Talend
PPTX
Dsc 2021 presentation_radovan_bacovic
PPTX
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
PPTX
The Future of Data Warehousing and Data Integration
PPTX
Cloud-Con: Integration & Web APIs
PDF
The 3 Key Barriers Keeping Companies from Deploying Data Products
The Importance of DataOps in a Multi-Cloud World
Talend 6.1 - What's New in Talend?
Moving to the Cloud: Modernizing Data Architecture in Healthcare
Webinar: The Death of Traditional Data Integration
Mike Tuche, CEO of Talend: Enabling the Data Driven Enterprise
Embracing Cloud Agility to Maximize Flexibility & Performance
Analytics in a Day Ft. Synapse Virtual Workshop
 
Data Engineering Efficiency @ Netflix - Strata 2017
Unleash the Power of Big Data and Machine Learning
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
 
5 Simple Steps to Unleash Big Data Talend Connect
The Model Enterprise: A Blueprint for Enterprise Data Governance
Webinar: The 5 Most Critical Things to Understand About Modern Data Integration
Meg Mude, Intel - Data Engineering Lifecycle Optimized on Intel - H2O World S...
Achieving Agility and Scale for Your Data Lake - Talend
Dsc 2021 presentation_radovan_bacovic
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
The Future of Data Warehousing and Data Integration
Cloud-Con: Integration & Web APIs
The 3 Key Barriers Keeping Companies from Deploying Data Products
Ad

Similar to Overcoming DataOps hurdles for ML in Production (20)

PPTX
Democratizing Data Science in the Enterprise
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PPTX
The Evolution of a Scrappy Startup to a Successful Web Service
DOCX
Vadlamudi saketh30 (ml)
PPTX
StreamCentral for the IT Professional
PDF
When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Build it…will they come by Shawn Trainer
PPTX
Unlocking Operational Intelligence from the Data Lake
PDF
Smarter Analytics: Supporting the Enterprise with Automation
PDF
Digital_IOT_(Microsoft_Solution).pdf
PPTX
Emvigo Data Visualization - E Commerce Deck
PPTX
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
DataOps - The Foundation for Your Agile Data Architecture
PDF
2022 Trends in Enterprise Analytics
PDF
Data and Application Modernization in the Age of the Cloud
PDF
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
PDF
The Shifting Landscape of Data Integration
PPTX
Deliveinrg explainable AI
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Democratizing Data Science in the Enterprise
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
The Evolution of a Scrappy Startup to a Successful Web Service
Vadlamudi saketh30 (ml)
StreamCentral for the IT Professional
When and How Data Lakes Fit into a Modern Data Architecture
Build it…will they come by Shawn Trainer
Unlocking Operational Intelligence from the Data Lake
Smarter Analytics: Supporting the Enterprise with Automation
Digital_IOT_(Microsoft_Solution).pdf
Emvigo Data Visualization - E Commerce Deck
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
2022 Trends in Enterprise Analytics
Data and Application Modernization in the Age of the Cloud
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
The Shifting Landscape of Data Integration
Deliveinrg explainable AI
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Microsoft Core Cloud Services powerpoint
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Transcultural that can help you someday.
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Managing Community Partner Relationships
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Modelling in Business Intelligence , information system
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
DATA COLLECTION METHODS-ppt for nursing research
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Predictive modeling basics in data cleaning process
Microsoft Core Cloud Services powerpoint
A Complete Guide to Streamlining Business Processes
Galatica Smart Energy Infrastructure Startup Pitch Deck
Transcultural that can help you someday.
Optimise Shopper Experiences with a Strong Data Estate.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Managing Community Partner Relationships
Database Infoormation System (DBIS).pptx
Modelling in Business Intelligence , information system
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Overcoming DataOps hurdles for ML in Production

  • 1. 1 Overcoming DataOps hurdles for ML in Production August 2020 SANDEEP UTTAMCHANDANI CHIEF DATA OFFICER and VP OF ENGINEERING [email protected]
  • 2. 2 Behind the scenes of a ML Model in Production
  • 3. 3 DATA ML Model in Production Discover Prep Build Operationalize DataOps
  • 4. 4 Top 10 DataOps Battlescars
  • 5. Levels of Automation Gather technical metadata Gather operational metadata Aggregate tribal knowledge 1. “I thought the attribute means something else” Battlescar: Incorrect assumptions about the meaning of attributes, whether it is the source of truth, owner/common users, versioning, whether dataset is trustworthy? Metric: Time to Interpret Building a Self-Service Metadata Catalog
  • 6. 1. “I thought the attribute means something else?” Battlescar: Incorrect assumptions about the meaning of attributes, whether it is the source of truth, owner/common users, versioning, whether dataset is trustworthy? Metric: Time to Interpret Building a Self-Service Metadata Catalog Intuit
  • 7. 7 2. “Where is the dataset I need for my model?” Battlescar: Building a customer support forecasting model. Data was silo’ed across business units. 4+ months of connecting to data stewards to locate the data attributes required for building the model Building a Self-Service Search Service Levels of Automation Indexing of datasets & artifacts Search Relevance ranking Access control of search results Metric: Time to Find
  • 8. 8 Battlescar: Building a customer support forecasting model. Data was silo’ed across business units. 4+ months of connecting to data stewards to locate the data attributes required for building the model Building a Self-Service Search Service Metric: Time to Find 2. “Where is the dataset I need for my model?”
  • 9. 9 3. “1000 rows in source database -- why only 50 rows in data lake?” Battlescar: Issues in correctness, completeness, timeliness in moving data daily/hourly from transactional datastores to centralized data lake Metric: Time to Move Building a Self-Service Data Movement service Data Ingestion Configuration Data Transformation Change Mgt Levels of Automation
  • 10. 10 4. “Job completed but dashboard graphs have data missing?” Battlescar: Jobs are orchestrated using schedulers (such as Airflow, Oozie). Several times, the job dependencies are incorrect, leading to reporting or model training jobs to be triggered prematurely. Metric: Time to Orchestrate Building a Self-Service orchestration Service Levels of Automation Defining Job Dependencies Robust Job Execution Production Monitoring
  • 11. 11 5. “Data processing was supposed to complete at 8 am. Its 4pm and my model retraining job is still waiting?” Battlescar: Writing efficient Big Data processing applications is non-trivial. With plethora of technologies, gaining broad expertise is difficult even for expert data engineers. Metric: Time to Optimize Building a Self-Service query optimization Service Levels of Automation Aggregating query, cluster, resource Stats Analyzing & correlating stats Tuning Jobs
  • 12. 12 6. “Customer changed preference to no marketing emails. Why are we still including in email campaigns?” Battlescar: Without a consistent primary key to identify the customer across data silos, where recurring issues arise. Emerging Data Rights such as GDPR, CCPA, require complying with customer preferences on what data is collected, how it is used, deleted on request. Metric: Time to Comply Building a Self-Service data rights governance Service Levels of Automation Tracking customer data lifecycle and preferences Executing customer’s data rights requests Use-case based access control
  • 13. 13 7. “Job pipeline ran for 15 hours and now we detect data quality issue upon completion -- could we be proactive?” Battlescar: Data issues in a long running business critical job leads to missing insights. Only when results don't look correct that we realize there is an issue. Metric: Time to Insights Quality Building a Self-Service data observability Service Levels of Automation Verify accuracy of data Detect anomalies Avoid data quality issues
  • 14. 14 8. “Using the best polyglot datastores -- how do I now write queries effectively across this data?” Battlescar: Significant time spent in planning, design, and writing queries that process data across datastores Metric: Time to Virtualize Datastores Building a Self-Service data virtualization Service Levels of Automation Automatic query routing Managing datastore specific queries Joining across transactional sources
  • 15. 15 9. “I ran a A/B experiment -- need to build time-consuming data pipelines to now analyze the data” Battlescar: Analyzing experimental results in a consistent fashion is a nightmare. No consistent definitions between metrics used for experimental analysis and business reporting Metric: Time to A/B Test Building a Self-Service A/B Testing Service Levels of Automation
  • 16. 16 10. “Data processing jobs last week cost us 30% more. Why?” Battlescar: Especially in the cloud, $ cost is linear to usage. Tracking budgets and spend to effectively optimize requires non-trivial effort. Metric: Time to Cost Governance Building a Self-Service cost governance Service Levels of Automation Expenditure Observability Matching Supply-Demand Continuous Cost Optimization
  • 17. 17 Wrap up: Advice on Managing your DataOps
  • 20. 20 Discover Prep Build Operationalize TIME-TO-INSIGHT Measuring Current DataOps: Time-to-Insight Metric DATA
  • 21. 21 Discover Prep Build Operationalize Time-to-Insight Scorecard
  • 22. 22 Discover Prep Build Operationalize Creating Your Time-to-Insight Scorecard WeeksDaysHoursLegend:
  • 23. 23 Call for Action: Making DataOps Self-Service 1. Measure Create your Time-to-Insight Scorecard Self-Service DataOps 2. Learn Shortlist 1-2 scorecard metrics to improve level of automation 3. Build Implement well-known design patterns in your data platform to make the metrics self-service
  • 24. 24 Upcoming Book: The Self-Service Data Roadmap Available Sept’20 Early Release Available on O’Reilly: https://www.oreilly.com/library/view/the-self-service-data/9781492075240/
  • 25. 25 CONTACT US TO SCHEDULE A DATA OPERATIONS HEALTH CHECK TODAY [email protected]