Open core summit: Observability for data pipelines with OpenLineage

Observability for data pipelines
with Open Lineage
Julien Le Dem
CTO & Co-Founder Datakin
@J_

AGENDA
Open Lineage and Marquez
Why metadata?
Community

Need to create a healthy
data ecosystem

Team interdependencies
Team A Team B
Team C

DATA
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context

Maslow’s Data hierarchy of needs
New Business Opportunities
Business optimization
Data Quality
Data Freshness
Data Availability

Problem Today:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage

Purpose
- Open standard for metadata and lineage collection
- Instrument jobs as they are running
- Define a generic model of job/dataset/runs entities
- Consistent naming strategies for jobs and datasets
- Define specific facets that can enrich those entities

Projects involved in Open Lineage (so far)

Open Lineage scope Not in scope
BackendIntegrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...

Consistent naming:
- Jobs:
Example: scheduler.job.task
- Datasets:
Example: instance.schema.table
Core Model

Facets
Facets are atomic pieces of metadata identified by a unique name
that can be attached to the core entities.
Prefixes in facet names allow the definition of Custom facets that
can be promoted to the spec at a later point.

Facet examples
Dataset:
- Stats
- Schema
- Version
- Column level
lineage
Job:
- Source code
- Dependencies
- params
- Source control
- Query plan
- Query profile
Run:
- Schedule time
- Batch id

Protocol
- Asynchronous events
- unique id for identifying a run and correlate events
- Conﬁgurable backend
- Kafka
- Http
- ...

Lifecycle
- Create unique run id
- Run start event
- Send plan/proﬁle info
- Run complete event
- Send output Dataset version updates

Join the conversation
Github: https://github.com/OpenLineage
Slack: OpenLineage.slack.com
Email: https://groups.google.com/g/openlineage

Open core summit: Observability for data pipelines with OpenLineage

Data
Operations
Data
Governance
Data
Discovery

http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf

Metadata:
Ingest Storage Compute
StreamingBatch/ETL
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI

Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version

v1 v4Dataset
v2
v4
v4
Job
v1
Dataset
v4
Job
v2
Marquez: Data model
● Debugging
○ What job version(s) produced and
consumed dataset version X?
● Backﬁlling
○ Full / incremental processing
Design beneﬁts

● Centralized metadata
management
○ Sources
○ Datasets
○ Jobs
● Modular framework
○ Data governance
○ Data lineage
○ Data discovery +
exploration
Metadata Service
Marquez: Design
Marquez
Core
Lineage
Search
REST API
ETL Batch Stream

Extensions
datakin
Lineage
analysis
Lineage collectionAPIs
Integrations
Client -
side
Metadata
Core
DB
Graph
Storage
Marquez UI
Listener
Core API

01 Job
v1
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[],
...
}
LINEAGE
JOBDATASET
Marquez: Metadata collection

{
"type":"BATCH",
"inputs":[{
}],
"outputs":[],
...
}
JOBDATASET
Marquez: Metadata collection
02 Job
v2
{
"type":"BATCH",
"inputs":[{
}],
"outputs":[{
"name":"room_bookings_aggs”
}],
...
}
LINEAGE
LINEAGE
01 Job
v1

Marquez
API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ parameters
○ version
○ inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Datakin
Lineage analysis
Graph
Integrations

https://marquezproject.github.io/marquez

Neutral
● Not controlled by
a company
● Community
driven
Community
● Build trust
● Grow adoption
● Everybody is on
an equal footing
Governance
● Decision
mechanisms
● Becoming a
maintainer
● Code of Conduct
Part of the LF AI & Data foundation

github.com/MarquezProject/marquez
@MarquezProject

Open core summit: Observability for data pipelines with OpenLineage

More Related Content

What's hot (20)

Similar to Open core summit: Observability for data pipelines with OpenLineage (20)

More from Julien Le Dem (20)

Recently uploaded (20)

Open core summit: Observability for data pipelines with OpenLineage