SlideShare a Scribd company logo
Engineering Machine Learning Data Pipelines
Tracking Data Lineage
Paige Roberts
Integrate Product Marketing Manager
Common Machine Learning Applications
Engineering Machine Learning Data Pipelines
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2
Data Scientist
Engineering Machine Learning Data Pipelines3
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• When machine learning model is trained, tested,
and proven it will accomplish the goal, turns it over
to data engineer to productionize. Not skilled at
taking the model from a test sandbox into
production, especially not at large scale.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data,
but usually has more on their plate than they can
keep up with. Anything that will speed up their work
is helpful.
• In most successful companies, is involved from the
beginning. First gathers, cleans and standardizes
data, helps data scientist with feature engineering,
provides top notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer
Engineering Machine Learning Data Pipelines4
Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.
2. Data Cleansing at Scale
Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools
are not designed to work on that scale of data.
3. Entity Resolution
Distinguishing matches across massive datasets that indicate a single specific entity (person, company,
product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power.
Essentially everything has to be compared to everything else.
4. Tracking Lineage from the Source
Data changes made to help train models have to be exactly duplicated in production, in order for models
to accurately make predictions on new data, and for required audit trails. Capture of complete lineage,
from source to end point is needed.
5
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
6
End-to-End Data Lineage
Data Sources Data Lake
Data
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
7
End-to-End Data Lineage
Data Sources
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
8
End-to-End Data Lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
9
End-to-End Data Lineage
Data Sources
Navigator or Atlas
gathers any other
changes made to
data on cluster.
Pass source-to-cluster
data lineage info to
Navigator or Atlas.
Data Lake
Data changes made
by MapReduce,
Spark, HiveQL.
Data
Data Lineage
REST
API
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Transform, join,
cleanse, enhance
data in cluster
with MapReduce
or Spark.
Auditors
get end-to-end
data lineage.
Analytics,
visualizations, and
machine learning
algorithms get ALL
necessary data.
Analytics,
Visualization,
Machine
Learning
Complete
Data
10
Syncsort Published Lineage in Cloudera Navigator
Engineering Machine Learning Data Pipelines11
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

More Related Content

PPTX
Evolution of big data
PDF
Iterative data discovery and transformation with open refine
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
DOC
Strayer cis 515 week 10 technical paper database administrator for department...
PPT
Strayer cis-515-week-10-technical-paper-database-administrator-for-department...
PPTX
Big Data Analytics Using Hadoop
PPTX
(The life of a) Data engineer
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Evolution of big data
Iterative data discovery and transformation with open refine
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
Strayer cis 515 week 10 technical paper database administrator for department...
Strayer cis-515-week-10-technical-paper-database-administrator-for-department...
Big Data Analytics Using Hadoop
(The life of a) Data engineer
Big data: Descoberta de conhecimento em ambientes de big data e computação na...

What's hot (18)

PDF
Case Study mypetstop detailed
PPTX
Data warehouse testing
PPTX
3 Ways Tableau Improves Predictive Analytics
PDF
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
PPTX
Glue Conference
PPTX
Hadoop - An Introduction
PPTX
Bigdata
PDF
Big Data Engineer Roles & Responsibilities | Edureka
PPTX
Hadoop Turns a Corner and Sees the Future
PDF
Future of Data - Big Data
PDF
Not Your Father's Database by Databricks
PPTX
Data science life cycle
PDF
PPTX
Big Data Ecosystem
PPTX
Data science big data and analytics
PPTX
Introduction to Data Science
PPT
Data Mining and Data Warehousing
PPTX
Predictive analytics and big data tutorial
Case Study mypetstop detailed
Data warehouse testing
3 Ways Tableau Improves Predictive Analytics
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Glue Conference
Hadoop - An Introduction
Bigdata
Big Data Engineer Roles & Responsibilities | Edureka
Hadoop Turns a Corner and Sees the Future
Future of Data - Big Data
Not Your Father's Database by Databricks
Data science life cycle
Big Data Ecosystem
Data science big data and analytics
Introduction to Data Science
Data Mining and Data Warehousing
Predictive analytics and big data tutorial
Ad

Similar to Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source (20)

PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PPTX
Deliveinrg explainable AI
PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
PDF
Data - Science and Engineering slide at Bandungpy Sharing Session
PDF
Big data pipelines
PPTX
Demystifying data engineering
PDF
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PDF
Real World End to End machine Learning Pipeline
PPTX
Is Spark the right choice for data analysis ?
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Power Software Development with Apache Spark
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
Data Analytics Today - Data, Tech, and Regulation.pdf
PPTX
semana1.pptx
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Productionising Machine Learning Models
Introduction to Data Engineer and Data Pipeline at Credit OK
Deliveinrg explainable AI
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Data - Science and Engineering slide at Bandungpy Sharing Session
Big data pipelines
Demystifying data engineering
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Real World End to End machine Learning Pipeline
Is Spark the right choice for data analysis ?
Machine learning and big data @ uber a tale of two systems
Power Software Development with Apache Spark
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Data Analytics Today - Data, Tech, and Regulation.pdf
semana1.pptx
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Productionising Machine Learning Models
Ad

More from Precisely (20)

PDF
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Introducing Syncsort™ Storage Management.pdf
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
PDF
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
PDF
The 2025 Guide on What's Next for Automation.pdf
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
PDF
The Changing Compliance Landscape in 2025.pdf
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
Unlock new opportunities with location data.pdf
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Introducing Syncsort™ Storage Management.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
The 2025 Guide on What's Next for Automation.pdf
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
The Changing Compliance Landscape in 2025.pdf

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Spectroscopy.pptx food analysis technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
sap open course for s4hana steps from ECC to s4
Spectroscopy.pptx food analysis technology
Chapter 3 Spatial Domain Image Processing.pdf
A comparative analysis of optical character recognition models for extracting...
MYSQL Presentation for SQL database connectivity
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

  • 1. Engineering Machine Learning Data Pipelines Tracking Data Lineage Paige Roberts Integrate Product Marketing Manager
  • 2. Common Machine Learning Applications Engineering Machine Learning Data Pipelines • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer 2
  • 3. Data Scientist Engineering Machine Learning Data Pipelines3 Data Engineer to the Rescue • Expert in statistical analysis, machine learning techniques, finding answers to business questions buried in datasets. • Does NOT want to spend 50 – 90% of their time tinkering with data, getting it into good shape to train models – but frequently does, especially if there’s no data engineer on their team. • When machine learning model is trained, tested, and proven it will accomplish the goal, turns it over to data engineer to productionize. Not skilled at taking the model from a test sandbox into production, especially not at large scale. • Expert in data structures, data manipulation, and constructing production data pipelines. • WANTS to spend all of their time working with data, but usually has more on their plate than they can keep up with. Anything that will speed up their work is helpful. • In most successful companies, is involved from the beginning. First gathers, cleans and standardizes data, helps data scientist with feature engineering, provides top notch data, ready to train models. • After model is tested, builds robust high scale, data pipelines to feed the models the data they need in the correct format in production to provide ongoing business value. Data Engineer
  • 4. Engineering Machine Learning Data Pipelines4 Five Big Challenges of Engineering ML Data Pipelines 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in incompatible formats, making it difficult to gather and prepare the data for model training. 2. Data Cleansing at Scale Data quality cleansing and preparation routines have to be reproduced at scale. Most data quality tools are not designed to work on that scale of data. 3. Entity Resolution Distinguishing matches across massive datasets that indicate a single specific entity (person, company, product, etc.) requires sophisticated multi-field matching algorithms and a lot of compute power. Essentially everything has to be compared to everything else. 4. Tracking Lineage from the Source Data changes made to help train models have to be exactly duplicated in production, in order for models to accurately make predictions on new data, and for required audit trails. Capture of complete lineage, from source to end point is needed.
  • 5. 5 End-to-End Data Lineage Data Sources Data Lake Data Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 6. 6 End-to-End Data Lineage Data Sources Data Lake Data Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 7. 7 End-to-End Data Lineage Data Sources Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 8. 8 End-to-End Data Lineage Data Sources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data changes made by MapReduce, Spark, HiveQL. Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark.
  • 9. 9 End-to-End Data Lineage Data Sources Navigator or Atlas gathers any other changes made to data on cluster. Pass source-to-cluster data lineage info to Navigator or Atlas. Data Lake Data changes made by MapReduce, Spark, HiveQL. Data Data Lineage REST API Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Transform, join, cleanse, enhance data in cluster with MapReduce or Spark. Auditors get end-to-end data lineage. Analytics, visualizations, and machine learning algorithms get ALL necessary data. Analytics, Visualization, Machine Learning Complete Data
  • 10. 10 Syncsort Published Lineage in Cloudera Navigator
  • 11. Engineering Machine Learning Data Pipelines11