SlideShare a Scribd company logo
Observability for data pipelines
with Open Lineage
Julien Le Dem
CTO & Co-Founder Datakin
@J_
AGENDA
Open Lineage and Marquez
Why metadata?
Community
Why Metadata?
Need to create a healthy
data ecosystem
Team interdependencies
Team A Team B
Team C
DATA
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context
Maslow’s Data hierarchy of needs
New Business Opportunities
Business optimization
Data Quality
Data Freshness
Data Availability
Open Lineage
Problem Today:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage
Purpose
- Open standard for metadata and lineage collection
- Instrument jobs as they are running
- Dene a generic model of job/dataset/runs entities
- Consistent naming strategies for jobs and datasets
- Dene specic facets that can enrich those entities
Projects involved in Open Lineage (so far)
Open Lineage scope Not in scope
BackendIntegrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...
Core Model
Consistent naming:
- Jobs:
Example: scheduler.job.task
- Datasets:
Example: instance.schema.table
Core Model
Facets
Facets are atomic pieces of metadata identied by a unique name
that can be attached to the core entities.
Prexes in facet names allow the denition of Custom facets that
can be promoted to the spec at a later point.
Facet examples
Dataset:
- Stats
- Schema
- Version
- Column level
lineage
Job:
- Source code
- Dependencies
- params
- Source control
- Query plan
- Query profile
Run:
- Schedule time
- Batch id
Protocol
- Asynchronous events
- unique id for identifying a run and correlate events
- Congurable backend
- Kafka
- Http
- ...
Lifecycle
- Create unique run id
- Run start event
- Send plan/prole info
- Run complete event
- Send output Dataset version updates
Join the conversation
Github: https://github.com/OpenLineage
Slack: OpenLineage.slack.com
Email: https://groups.google.com/g/openlineage
Open core summit: Observability for data pipelines with OpenLineage
Data
Operations
Data
Governance
Data
Discovery
http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf
Metadata:
Ingest Storage Compute
StreamingBatch/ETL
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI
Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
v1 v4Dataset
v2
v4
v4
Job
v1
Dataset
v4
Job
v2
Marquez: Data model
● Debugging
○ What job version(s) produced and
consumed dataset version X?
● Backfilling
○ Full / incremental processing
Design benets
● Centralized metadata
management
○ Sources
○ Datasets
○ Jobs
● Modular framework
○ Data governance
○ Data lineage
○ Data discovery +
exploration
Metadata Service
Marquez: Design
Marquez
Core
Lineage
Search
REST API
ETL Batch Stream
Extensions
datakin
Lineage
analysis
Lineage collectionAPIs
Integrations
Client -
side
Metadata
Core
DB
Graph
Storage
Marquez UI
Listener
Core API
01 Job
v1
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[],
...
}
LINEAGE
JOBDATASET
Marquez: Metadata collection
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[],
...
}
JOBDATASET
Marquez: Metadata collection
02 Job
v2
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[{
"namespace":"datascience",
"name":"room_bookings_aggs”
}],
...
}
LINEAGE
LINEAGE
01 Job
v1
Marquez
API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ parameters
○ version
○ inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Datakin
Lineage analysis
Graph
Integrations
Community
https://marquezproject.github.io/marquez
Neutral
● Not controlled by
a company
● Community
driven
Community
● Build trust
● Grow adoption
● Everybody is on
an equal footing
Governance
● Decision
mechanisms
● Becoming a
maintainer
● Code of Conduct
Part of the LF AI & Data foundation
github.com/MarquezProject/marquez
@MarquezProject
Thank You

More Related Content

PDF
Data Pipline Observability meetup
PDF
What’s New with Databricks Machine Learning
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPTX
Application performance monitoring with Elastic APM and the ELK stack
PDF
Elastic APM: Amping up your logs and metrics for the full picture
PDF
Getting Started with Delta Lake on Databricks
PPTX
Free Training: How to Build a Lakehouse
PPTX
Data ingestion
Data Pipline Observability meetup
What’s New with Databricks Machine Learning
Introduction SQL Analytics on Lakehouse Architecture
Application performance monitoring with Elastic APM and the ELK stack
Elastic APM: Amping up your logs and metrics for the full picture
Getting Started with Delta Lake on Databricks
Free Training: How to Build a Lakehouse
Data ingestion

What's hot (20)

PPTX
Demystifying data engineering
PPTX
Introduction to Data Engineering
PDF
Modernizing to a Cloud Data Architecture
PPTX
Modern Data Architecture
PDF
DI&A Slides: Data Lake vs. Data Warehouse
PPTX
Data Observability.pptx
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Data Mesh Part 4 Monolith to Mesh
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Future of Data Engineering
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Change Data Feed in Delta
PDF
From Data Warehouse to Lakehouse
PDF
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Making Apache Spark Better with Delta Lake
PPTX
Introduction to Data Engineering
PDF
Data Lake Architecture – Modern Strategies & Approaches
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Evolution from EDA to Data Mesh: Data in Motion
Demystifying data engineering
Introduction to Data Engineering
Modernizing to a Cloud Data Architecture
Modern Data Architecture
DI&A Slides: Data Lake vs. Data Warehouse
Data Observability.pptx
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Mesh Part 4 Monolith to Mesh
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Future of Data Engineering
DW Migration Webinar-March 2022.pptx
Change Data Feed in Delta
From Data Warehouse to Lakehouse
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Architect’s Open-Source Guide for a Data Mesh Architecture
Making Apache Spark Better with Delta Lake
Introduction to Data Engineering
Data Lake Architecture – Modern Strategies & Approaches
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Evolution from EDA to Data Mesh: Data in Motion
Ad

Similar to Open core summit: Observability for data pipelines with OpenLineage (20)

PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Data and AI summit: data pipelines observability with open lineage
PDF
Observability for Data Pipelines With OpenLineage
PPTX
Deploying Data Science Engines to Production
PPTX
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
PPTX
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
DOCX
Ajith_kumar_4.3 Years_Informatica_ETL
PDF
Continuous delivery for machine learning
PPTX
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
PDF
Enterprise guide to building a Data Mesh
PDF
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
 
PDF
Developing Enterprise Consciousness: Building Modern Open Data Platforms
DOCX
Resume (1)
DOCX
Resume (1)
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
DOC
Bigdata.sunil_6+yearsExp
PPTX
Bitkom Cray presentation - on HPC affecting big data analytics in FS
DOC
Sandeep Grandhi (1)
Data pipelines observability: OpenLineage & Marquez
Data and AI summit: data pipelines observability with open lineage
Observability for Data Pipelines With OpenLineage
Deploying Data Science Engines to Production
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Ajith_kumar_4.3 Years_Informatica_ETL
Continuous delivery for machine learning
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
Enterprise guide to building a Data Mesh
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Resume (1)
Resume (1)
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Bigdata.sunil_6+yearsExp
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Sandeep Grandhi (1)
Ad

More from Julien Le Dem (20)

PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
Data lineage and observability with Marquez - subsurface 2020
PPTX
Strata NY 2018: The deconstructed database
PDF
From flat files to deconstructed database
PPTX
Strata NY 2017 Parquet Arrow roadmap
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
Sql on everything with drill
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
How to use Parquet as a basis for ETL and analytics
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
Parquet Strata/Hadoop World, New York 2013
PDF
Parquet Hadoop Summit 2013
PDF
Parquet Twitter Seattle open house
PPT
Parquet overview
PPTX
Poster Hadoop summit 2011: pig embedding in scripting languages
Data platform architecture principles - ieee infrastructure 2020
Data lineage and observability with Marquez - subsurface 2020
Strata NY 2018: The deconstructed database
From flat files to deconstructed database
Strata NY 2017 Parquet Arrow roadmap
The columnar roadmap: Apache Parquet and Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Mule soft mar 2017 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Sql on everything with drill
If you have your own Columnar format, stop now and use Parquet 😛
How to use Parquet as a basis for ETL and analytics
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Parquet Strata/Hadoop World, New York 2013
Parquet Hadoop Summit 2013
Parquet Twitter Seattle open house
Parquet overview
Poster Hadoop summit 2011: pig embedding in scripting languages

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
 
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Tartificialntelligence_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Assigned Numbers - 2025 - BluetoothÂŽ Document
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Getting Started with Data Integration: FME Form 101
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
 
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
 
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectroscopy.pptx food analysis technology
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
Tartificialntelligence_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Assigned Numbers - 2025 - BluetoothÂŽ Document
Diabetes mellitus diagnosis method based random forest with bat algorithm
Getting Started with Data Integration: FME Form 101
The Rise and Fall of 3GPP – Time for a Sabbatical?
 

Open core summit: Observability for data pipelines with OpenLineage