SlideShare a Scribd company logo
Using Databricks as an Analysis Platform
Using databricks as
an analysis platform
Anup Segu
Agenda
Extending databricks to provide a robust
analytics platform
Why a platform?
What is in our platform?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
YipitData’s Platform
YipitData Answers Key Investor Questions
▪ 70+ research products covering U.S. and international companies
▪ Email reports, excel files, and data downloads
▪ Transaction data, web data, app data, targeted interviews, and adding more
▪ Clients include over 200 investment funds and fortune 500 companies
▪ 53 data analysts and 3 data engineers
▪ 22 engineers total
▪ We are rapidly growing and hiring!
About Me
▪ Senior Software Engineer
▪ Manage platform for ETL
workloads
▪ Based out of NYC
▪ linkedin.com/in/anupsegu
We Want Analysts To Own The Product
Data
Collection
Data
Exploration
ETL
Workflows
Report
Generation
www
SELECT *
FROM ...
Using Databricks as an Analysis Platform
EngineersAnalysts
Providing a
Platform
Answering
Questions
Python Library Inside Notebooks
Ingesting data
Wide range of
table sizes and schemas
1 PB
Compressed Parquet
60 K
Tables
1.7 K
Databases
Readypipe: From URLs To Parquet
▪ Repeatedly capture a snapshot of
the website
▪ Websites frequently change
▪ Makes data available quickly for
analysis
Glue Metastore
Streaming As JSON Is Great
▪ Append only data in S3
▪ We don’t know the schema ahead of time
▪ Only flat column types
s3://{json_bucket}/{project_name}
/{table}/...
JSON Bucket Parquet Bucket
Kinesis Firehose
Parquet Makes Data “Queryable”
▪ Create or update databases, tables, and
schemas as needed
▪ Partitioned by the date of ingestion
▪ Spark cluster subscribed to write events
s3://{parquet_bucket}/{project_name}
/{table}/dt={date}...
JSON Bucket
Kinesis Firehose
Glue Metastore
Parquet Bucket
Compaction = Greater Performance
▪ Insert files into new S3 locations
▪ Update partitions in Glue
▪ Pick appropriate column lengths for optimal
file counts
s3://{parquet_bucket}/{project_name}
/{table}/compacted/dt={date}...
JSON Bucket
Kinesis Firehose
Glue Metastore
Parquet Bucket
With 3rd Party Data, We Strive for Uniformity
Various File
Formats
Permissions
Challenges
Data
Lineage
Data
Refreshes
403
Access Denied
Databricks Helps Manage 3rd Party Data
▪ Upload files and convert to parquet with
additional metadata
▪ Configure data access by assuming IAM roles
within notebooks
Table Utilities
Table: Database + Name + Data
Using Databricks as an Analysis Platform
Table Hygiene Pays Off
Validate table naming
conventions
Keep storage layer
organized
Maintain prior versions
of tables
Automate table
maintenance
However, Our Team Is Focused On Analysis
so best practices are built into “create_table”
Cluster Management
Wide Range Of Options For Spark Clusters
Hardware Permissions Spark Configuration
Driver instance Metastore Runtime
Worker instances S3 access Spark properties
EBS Volumes IAM Roles Environment Variables
Wide Range Of Options For Spark Clusters
Hardware Permissions Spark Configuration
Driver instance Metastore Runtime
Worker instances S3 access Spark properties
EBS Volumes IAM Roles Environment Variables
T-Shirt Sizes For Clusters
▪ 3 r5.xlarge instances
▪ Warm instance pool
for fast starts
▪ 10 r5.xlarge instances
▪ Larger EBS volumes
available if needed
“MEDIUM”“SMALL”
▪ 30 r5.xlarge instances
Larger EBS volumes for
heavy workloads
“LARGE”
Standard IAM Roles, Metastore, S3 access, and Environment Variables
Launch Spark Jobs With Ease
Names Map To Databricks Configurations
Databricks Does The Heavy Lifting
▪ Provisions compute resources via a REST API
▪ Scales instances for cluster load
▪ Applies a wide range of spark optimizations
ETL Workflow Automation
Airflow Is Our Preferred ETL Tool
Airflow Is Our Preferred ETL Tool
Requires someone to manage this code
We use the databricks API
to construct DAGs programmatically
+
1 DAG = 1 Folder, 1 Task = 1 Notebook
Templated Notebooks For DAGs
/folder
- commands
- notebook_a
- notebook_b
- notebook_c
Translate Notebooks Into DAG files
/api/2.0/workspace/list
/api/2.0/workspace/export
Automatically Create Workflows
▪ Pipelines are deployed without engineers
▪ Robust logging and error handling
▪ Easy to modify DAGs
▪ All happens within databricks
Task A
Task B
Task C
Platform Visibility
Tailored Monitoring Solutions
Standardize Logs As Data
Visualize Logs In Notebooks
A Platform Invites New Solutions
▪ Establish standard queries and notebooks
▪ Trigger one DAG from one another
▪ Trigger reporting processes after ETL jobs
Thank You
Interested in working with us?
We are hiring!
yipitdata.com/careers
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Appendix I: Compaction Code
Compacting Partitions
Compacting Partitions (cont.)
Compacting Partitions (cont.)
Appendix II: Table Creation Code
Capturing metadata with source data
Creating a table
Creating a table (cont.)
Creating a table (cont.)
Creating a table (cont.)
Appendix III: Databricks Jobs Code
Create a Databricks Job
Create a Databricks Job (cont.)
Appendix IV: Airflow Code
Automatic DAG Creation
Automatic DAG Creation (cont.)
Automatic DAG Creation (cont.)
Automatic DAG Creation (cont.)
Using Databricks as an Analysis Platform

More Related Content

PPTX
Databricks for Dummies
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Data Discovery at Databricks with Amundsen
PDF
Moving to Databricks & Delta
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PPTX
Data Engineering A Deep Dive into Databricks
PDF
Building End-to-End Delta Pipelines on GCP
PDF
Getting Started with Databricks SQL Analytics
Databricks for Dummies
Unified Big Data Processing with Apache Spark (QCON 2014)
Data Discovery at Databricks with Amundsen
Moving to Databricks & Delta
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Data Engineering A Deep Dive into Databricks
Building End-to-End Delta Pipelines on GCP
Getting Started with Databricks SQL Analytics

What's hot (20)

PDF
Introducing Databricks Delta
PDF
What’s New with Databricks Machine Learning
PDF
Databricks Delta Lake and Its Benefits
PPTX
Microsoft Azure Databricks
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Intro to Delta Lake
PPTX
Data Lakehouse Symposium | Day 4
PPTX
Data Lake Overview
PPTX
Databricks Fundamentals
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PPTX
Introduction to Azure Databricks
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Delta lake and the delta architecture
PDF
Getting Started with Delta Lake on Databricks
PDF
Data Mesh for Dinner
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
PPTX
Demystifying data engineering
Introducing Databricks Delta
What’s New with Databricks Machine Learning
Databricks Delta Lake and Its Benefits
Microsoft Azure Databricks
Introduction SQL Analytics on Lakehouse Architecture
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Intro to Delta Lake
Data Lakehouse Symposium | Day 4
Data Lake Overview
Databricks Fundamentals
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Making Data Timelier and More Reliable with Lakehouse Technology
Introduction to Azure Databricks
Azure Synapse Analytics Overview (r2)
Delta lake and the delta architecture
Getting Started with Delta Lake on Databricks
Data Mesh for Dinner
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Demystifying data engineering
Ad

Similar to Using Databricks as an Analysis Platform (20)

PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PPTX
Introduction to Databricks - AccentFuture
PDF
Modernizing to a Cloud Data Architecture
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
PPTX
Nouveautes_Databricks decouvrire un use case general
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PPTX
Introduction_to_Databricks_power_point_presentation.pptx
PPTX
Databricks Platform.pptx
PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Architecting an Open Source AI Platform 2018 edition
PDF
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
PPTX
TechEvent Databricks on Azure
PDF
Databricks and Logging in Notebooks
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Building a Turbo-fast Data Warehousing Platform with Databricks
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Introduction to Databricks - AccentFuture
Modernizing to a Cloud Data Architecture
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Nouveautes_Databricks decouvrire un use case general
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Data infrastructure architecture for medium size organization: tips for colle...
Introduction_to_Databricks_power_point_presentation.pptx
Databricks Platform.pptx
DW Migration Webinar-March 2022.pptx
Architecting an Open Source AI Platform 2018 edition
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
TechEvent Databricks on Azure
Databricks and Logging in Notebooks
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to the R Programming Language
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Lecture1 pattern recognition............
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Predictive modeling basics in data cleaning process
PPTX
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
SAP 2 completion done . PRESENTATION.pptx
Miokarditis (Inflamasi pada Otot Jantung)
[EN] Industrial Machine Downtime Prediction
Reliability_Chapter_ presentation 1221.5784
Quality review (1)_presentation of this 21
Introduction to the R Programming Language
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Lecture1 pattern recognition............
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Leprosy and NLEP programme community medicine
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Predictive modeling basics in data cleaning process
Introduction to Knowledge Engineering Part 1

Using Databricks as an Analysis Platform