SlideShare a Scribd company logo
5
Most read
6
Most read
14
Most read
Data Lake Demonstration
Building Data Lakes with Apache Airflow
Gary A. Stafford
Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com
Agenda
What is a Data Lake?
Dataset
Architecture
Source Code
Demonstration
What is a Data Lake?
What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS
What is a Data Lake?
Dataset
Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed to demonstrate Amazon Redshift Cloud Data Warehouse
Small database consists of seven tables: two fact and five dimension tables
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
Building Data Lakes with Apache Airflow
Dataset
Table Simulated Datasource Demo Datasource
Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Listing COTS E-commerce Platform Amazon RDS for MySQL
Sales COTS E-commerce Platform Amazon RDS for MySQL
Date COTS E-commerce Platform Amazon RDS for MySQL
Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server
Dataset
Architecture
Architecture: AWS Services Used
Amazon Simple Storage Service (Amazon S3)
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): Handling changes to systems of record
Transactional Storage Layer: Managing changes to the SoR in the data lake
Streaming Data: Data continuously generated by different sources
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption
Architecture: Out of Scope (but critically important)
Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII)
DataOps: Automating testing, deployment, job execution
Infrastructure as Code (IaC): Infrastructure provisioning automation
Data Warehousing (Lake House architecture)
Data Lake Storage Tiering, Archival, and Backup
Source Code
github.com/garystafford/tickit-data-lake-demo
Demonstration

More Related Content

PPTX
1- Introduction of Azure data factory.pptx
PPTX
Azure data platform overview
PPTX
Introducing Azure SQL Database
PDF
Azure Data Factory V2; The Data Flows
PDF
Azure Data Factory v2
PPTX
Azure purview
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Logical Data Fabric: Architectural Components
1- Introduction of Azure data factory.pptx
Azure data platform overview
Introducing Azure SQL Database
Azure Data Factory V2; The Data Flows
Azure Data Factory v2
Azure purview
DW Migration Webinar-March 2022.pptx
Logical Data Fabric: Architectural Components

What's hot (20)

PPTX
SQL to Azure Migrations
PDF
Building an open data platform with apache iceberg
PPTX
Azure Data Factory Data Flow
PPTX
ADF Demo_ppt.pptx
PPTX
Microsoft Azure Data Factory Hands-On Lab Overview Slides
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PDF
Modernizing to a Cloud Data Architecture
PDF
Designing An Enterprise Data Fabric
PDF
PDF
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
PDF
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PPTX
Azure Data Lake Intro (SQLBits 2016)
PDF
Azure SQL Database
PDF
Introduction to Azure Data Lake
PPTX
Apache spark 소개 및 실습
PDF
Introduction to Azure Data Factory
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
PDF
Azure Synapse Analytics
SQL to Azure Migrations
Building an open data platform with apache iceberg
Azure Data Factory Data Flow
ADF Demo_ppt.pptx
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Modern Data Warehousing with the Microsoft Analytics Platform System
Modernizing to a Cloud Data Architecture
Designing An Enterprise Data Fabric
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Apache Iceberg - A Table Format for Hige Analytic Datasets
Building robust CDC pipeline with Apache Hudi and Debezium
Azure Data Lake Intro (SQLBits 2016)
Azure SQL Database
Introduction to Azure Data Lake
Apache spark 소개 및 실습
Introduction to Azure Data Factory
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Azure Synapse Analytics
Ad

Similar to Building Data Lakes with Apache Airflow (11)

PDF
Building a Data Lake on AWS
PDF
Owning Your Own (Data) Lake House
PDF
Your First Data Lake on AWS_Simon Elisha
PDF
AWS Big Data Landscape
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
PPTX
Databricks Platform.pptx
PDF
Serverless Big Data Architectures: Serverless Data Analytics
PDF
Building Serverless Data Infrastructure in the AWS Cloud
PPTX
AWS Certified Solutions Architect Professional Course S15-S18
PDF
Data Analysis - Journey Through the Cloud
Building a Data Lake on AWS
Owning Your Own (Data) Lake House
Your First Data Lake on AWS_Simon Elisha
AWS Big Data Landscape
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Databricks Platform.pptx
Serverless Big Data Architectures: Serverless Data Analytics
Building Serverless Data Infrastructure in the AWS Cloud
AWS Certified Solutions Architect Professional Course S15-S18
Data Analysis - Journey Through the Cloud
Ad

More from Gary Stafford (6)

PDF
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
PDF
Building Open Data Lakes on AWS with Debezium and Apache Hudi
PDF
How Mature is Your Infrastructure?
PDF
Infrastructure as Code Maturity Model v1
PDF
Enterprise DevOps Adoption LinkedIn
PDF
From Zurich to the Cosmos, by Artist Steve Carpenter
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Building Open Data Lakes on AWS with Debezium and Apache Hudi
How Mature is Your Infrastructure?
Infrastructure as Code Maturity Model v1
Enterprise DevOps Adoption LinkedIn
From Zurich to the Cosmos, by Artist Steve Carpenter

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Azure Data management Engineer project.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Understanding Prototyping in Design and Development
PPTX
Logistic Regression ml machine learning.pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Data Science Trends & Career Guide---ppt
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Global journeys: estimating international migration
PPTX
Challenges and opportunities in feeding a growing population
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Azure Data management Engineer project.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Understanding Prototyping in Design and Development
Logistic Regression ml machine learning.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Introduction to Knowledge Engineering Part 1
Data Science Trends & Career Guide---ppt
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Global journeys: estimating international migration
Challenges and opportunities in feeding a growing population
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Business Ppt On Nestle.pptx huunnnhhgfvu
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx

Building Data Lakes with Apache Airflow

  • 1. Data Lake Demonstration Building Data Lakes with Apache Airflow Gary A. Stafford
  • 3. Agenda What is a Data Lake? Dataset Architecture Source Code Demonstration
  • 4. What is a Data Lake?
  • 5. What is a Data Lake? “A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.” - Databricks “A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.” - AWS
  • 6. What is a Data Lake?
  • 8. Dataset TICKIT database E-commerce platform Bringing together buyers and sellers of tickets to entertainments events Designed to demonstrate Amazon Redshift Cloud Data Warehouse Small database consists of seven tables: two fact and five dimension tables Tables: Categories, Events, Venues, Users, Listings, Sales, Dates docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
  • 10. Dataset Table Simulated Datasource Demo Datasource Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Listing COTS E-commerce Platform Amazon RDS for MySQL Sales COTS E-commerce Platform Amazon RDS for MySQL Date COTS E-commerce Platform Amazon RDS for MySQL Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server
  • 13. Architecture: AWS Services Used Amazon Simple Storage Service (Amazon S3) AWS Glue Studio (alt. AWS Glue DataBrew) AWS Glue Data Catalog (alt. Apache Hive on EMR) AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect) AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR) Amazon Athena (alt. Presto on EMR) Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)
  • 16. Architecture: Out of Scope (but critically important) Change Data Capture (CDC): Handling changes to systems of record Transactional Storage Layer: Managing changes to the SoR in the data lake Streaming Data: Data continuously generated by different sources Fine-grained Authorization: database-, table-, column-, and row-level access Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption
  • 17. Architecture: Out of Scope (but critically important) Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII) DataOps: Automating testing, deployment, job execution Infrastructure as Code (IaC): Infrastructure provisioning automation Data Warehousing (Lake House architecture) Data Lake Storage Tiering, Archival, and Backup