SlideShare a Scribd company logo
Managing & Scaling Data Pipelines with
Databricks
Esha Shah
Senior Data Engineer
ATLASSIAN
Go-To-Market Data Engineering
Richa Singhal
Senior Data Engineer
Agenda
Atlassian Overview
Summary
Adopting Databricks
Data Platform Challenges
Scaling and Modernizing Data Platform with Databricks
Growth over the last 5 years
Data is now 20x times (Multi petabytes)
5x growth in numbers of internal users
5x number of events/day (Billions)
Atlassian Data Architecture (Before Databricks)
Key Challenges with Legacy Architecture
Development
Cross-team dependencies
Cluster management
Collaboration
Prepping for Scale
Self-service
Standardization
Automation
Agility
Cost Optimization
Current Atlassian Data Architecture
Our Success Story
Reduced development time
Rapid Development
Increased team and project efficiency with
simplified sharing and co-authoring
Collaboration
Were able to support growth while
reducing Infrastructure cost
Scaling
Removed Data engineering dependency for
Analytics and Data Science teams
Self Service
Adopting Databricks at Atlassian
Building Data Pipelines
Orchestration
Leveraging Databricks
Delta
Databricks for Analytics and
Data Science
Building Data Pipelines
Data Pipelines with Databricks
Data Pipelines using Notebooks
Data Pipelines using DB-Connect
Development using Databricks Notebook
AWS Cloud
Interactive
Cluster
Ephemeral
Cluster
Bitbucket
Branch
Databricks Workspace
Import/
Export
Jira Ticket
Command
Line
Databricks
Notebook
Databricks Cluster
Multi-stage Envs using Databricks Workspaces
Databricks
Notebook
Databricks
Workspace
Dev Folder
Local/
Development
Stage/
Production
Bitbucket CICD
Pipeline
Stg Folder
Prod Folder
Stg Cluster
Prod Cluster
Bitbucket CICD
Pipeline
branches:
main:
- step:
name: Check configuration file
deployment: test
script:
- pip install -r requirements.txt
- 'yamllint -d "{extends: default, rules: {}" config.yaml'
- python databricks_cicd/check_duplicates.py
- step:
name: Move code to Databricks
deployment: production
caches:
- pip
script:
- pip install -r requirements.txt
- bash databricks_cicd/move_code_to_databricks.sh prod
- step:
name: Update the job in Databricks
script:
- pip install -r requirements.txt
- python databricks_cicd/configure_job_in_databricks.py
Development using DB-Connect Library
AWS Cloud
Interactive
Cluster
Ephemeral
Cluster
Bitbucket
Branch
Local IDE
Pull Request
/Merge
db-connect
Jira Ticket
Databricks Cluster
Multi-stage Envs using AWS S3
Local IDE Databricks
Cluster
Dev Bucket
Local/
Development
Stage/
Production
Bitbucket CICD
Pipeline
Docker
Stg Bucket
Prod Bucket
Stg Cluster
Prod Cluster
Orchestration
Orchestration using Airflow
Airflow on
Kubernetes
SparkSubmit Task
YODA
In-house Data
Quality Platform
SignalFx
Opsgenie
On-Call
Notebook Task
Slack Notification
Code on S3
Notebook
Databricks Workspace
Tracking Resource Usage and Cost
Job Metadata
'custom_tags': {
'business_unit': 'Data Engineering',
'environment': cluster_env,
'pipeline': 'Team_name',
'user': 'airflow',
'resource_owner': '<resource_owner>',
'service_name': '<service-name>'
}
Data Lake
Ad Hoc Reporting
Databricks Job
Leveraging Databricks Delta
Delta
Time Travel Merge Auto-optimize
Databricks for Analytics and Data Science
Analytics Use Cases
Exploratory and root cause analysis
Analysis for Strategic Decisions
POC for new metrics and business logic
Creating and refreshing ad-hoc datasets
Team Onboarding Templates
Big Wins: Analytics
Self-service Collaboration
Data Science Use Cases
Exploration, Sizing
Feature generation
Model training
Scoring
Experiments
Analyzing results
Model serving
Big Wins: Data Science
Faster local
stack to cloud
cycle
No
infrastructure
overhead
Increased ML
adoption
across teams
Governance &
Tracking
Summary
Key Takeaways
Delivery time reduced by 30%
Decreased infrastructure costs by 60%
Databricks used by 50% of all Atlassians
Reduced Data team dependencies by
more than 70%
Thank you!
Feedback
Your feedback is important to us
Don’t forget to rate and review the sessions
Ad

Recommended

Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data engineering design patterns
Data engineering design patterns
Valdas Maksimavičius
 
Introducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Moving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Databricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Microsoft Azure Databricks
Microsoft Azure Databricks
Sascha Dittmann
 
Delta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Databricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Big data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Intro to Delta Lake
Intro to Delta Lake
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Continuous Integration & Continuous Delivery
Continuous Integration & Continuous Delivery
Databricks
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 

More Related Content

What's hot (20)

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Databricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Microsoft Azure Databricks
Microsoft Azure Databricks
Sascha Dittmann
 
Delta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Databricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Big data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Intro to Delta Lake
Intro to Delta Lake
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Databricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Microsoft Azure Databricks
Microsoft Azure Databricks
Sascha Dittmann
 
Delta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Databricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Big data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Intro to Delta Lake
Intro to Delta Lake
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 

Similar to Scaling and Modernizing Data Platform with Databricks (20)

Continuous Integration & Continuous Delivery
Continuous Integration & Continuous Delivery
Databricks
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Migrating Your Data Platform At a High Growth Startup
Migrating Your Data Platform At a High Growth Startup
Databricks
 
Master Databricks with AccentFuture – Online Training
Master Databricks with AccentFuture – Online Training
Accentfuture
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
Migration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptx
Kshitija(KJ) Gupte
 
Introduction to Databricks - AccentFuture
Introduction to Databricks - AccentFuture
Accentfuture
 
Databricks @ Strata SJ
Databricks @ Strata SJ
Databricks
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 
Productionalizing a spark application
Productionalizing a spark application
datamantra
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
Continuous Integration & Continuous Delivery
Continuous Integration & Continuous Delivery
Databricks
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Migrating Your Data Platform At a High Growth Startup
Migrating Your Data Platform At a High Growth Startup
Databricks
 
Master Databricks with AccentFuture – Online Training
Master Databricks with AccentFuture – Online Training
Accentfuture
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
Migration to Databricks - On-prem HDFS.pptx
Migration to Databricks - On-prem HDFS.pptx
Kshitija(KJ) Gupte
 
Introduction to Databricks - AccentFuture
Introduction to Databricks - AccentFuture
Accentfuture
 
Databricks @ Strata SJ
Databricks @ Strata SJ
Databricks
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 
Productionalizing a spark application
Productionalizing a spark application
datamantra
 
Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
Ad

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Ad

Recently uploaded (20)

SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
Grade 10 selection and placement (1).pptx
Grade 10 selection and placement (1).pptx
FIDELISMUSEMBI
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
Section Three - Project colemanite production China
Section Three - Project colemanite production China
VavaniaM
 
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
Taqyea
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
Addressing-the-Air-Quality-Crisis-in-New-Delhi.pptx
Addressing-the-Air-Quality-Crisis-in-New-Delhi.pptx
manpreetkaur3469
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
5. & 9. Packing material and Labelling_AP-60,XP-60.pdf
5. & 9. Packing material and Labelling_AP-60,XP-60.pdf
maricruzduranpaterni
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
Grade 10 selection and placement (1).pptx
Grade 10 selection and placement (1).pptx
FIDELISMUSEMBI
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...
apidays
 
Section Three - Project colemanite production China
Section Three - Project colemanite production China
VavaniaM
 
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
最新版美国亚利桑那大学毕业证(UA毕业证书)原版定制
Taqyea
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
Addressing-the-Air-Quality-Crisis-in-New-Delhi.pptx
Addressing-the-Air-Quality-Crisis-in-New-Delhi.pptx
manpreetkaur3469
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
5. & 9. Packing material and Labelling_AP-60,XP-60.pdf
5. & 9. Packing material and Labelling_AP-60,XP-60.pdf
maricruzduranpaterni
 

Scaling and Modernizing Data Platform with Databricks

  • 1. Managing & Scaling Data Pipelines with Databricks Esha Shah Senior Data Engineer ATLASSIAN Go-To-Market Data Engineering Richa Singhal Senior Data Engineer
  • 4. Growth over the last 5 years Data is now 20x times (Multi petabytes) 5x growth in numbers of internal users 5x number of events/day (Billions)
  • 5. Atlassian Data Architecture (Before Databricks)
  • 6. Key Challenges with Legacy Architecture Development Cross-team dependencies Cluster management Collaboration
  • 8. Current Atlassian Data Architecture
  • 9. Our Success Story Reduced development time Rapid Development Increased team and project efficiency with simplified sharing and co-authoring Collaboration Were able to support growth while reducing Infrastructure cost Scaling Removed Data engineering dependency for Analytics and Data Science teams Self Service
  • 10. Adopting Databricks at Atlassian Building Data Pipelines Orchestration Leveraging Databricks Delta Databricks for Analytics and Data Science
  • 12. Data Pipelines with Databricks Data Pipelines using Notebooks Data Pipelines using DB-Connect
  • 13. Development using Databricks Notebook AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Databricks Workspace Import/ Export Jira Ticket Command Line Databricks Notebook Databricks Cluster
  • 14. Multi-stage Envs using Databricks Workspaces Databricks Notebook Databricks Workspace Dev Folder Local/ Development Stage/ Production Bitbucket CICD Pipeline Stg Folder Prod Folder Stg Cluster Prod Cluster
  • 15. Bitbucket CICD Pipeline branches: main: - step: name: Check configuration file deployment: test script: - pip install -r requirements.txt - 'yamllint -d "{extends: default, rules: {}" config.yaml' - python databricks_cicd/check_duplicates.py - step: name: Move code to Databricks deployment: production caches: - pip script: - pip install -r requirements.txt - bash databricks_cicd/move_code_to_databricks.sh prod - step: name: Update the job in Databricks script: - pip install -r requirements.txt - python databricks_cicd/configure_job_in_databricks.py
  • 16. Development using DB-Connect Library AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Local IDE Pull Request /Merge db-connect Jira Ticket Databricks Cluster
  • 17. Multi-stage Envs using AWS S3 Local IDE Databricks Cluster Dev Bucket Local/ Development Stage/ Production Bitbucket CICD Pipeline Docker Stg Bucket Prod Bucket Stg Cluster Prod Cluster
  • 19. Orchestration using Airflow Airflow on Kubernetes SparkSubmit Task YODA In-house Data Quality Platform SignalFx Opsgenie On-Call Notebook Task Slack Notification Code on S3 Notebook Databricks Workspace
  • 20. Tracking Resource Usage and Cost Job Metadata 'custom_tags': { 'business_unit': 'Data Engineering', 'environment': cluster_env, 'pipeline': 'Team_name', 'user': 'airflow', 'resource_owner': '<resource_owner>', 'service_name': '<service-name>' } Data Lake Ad Hoc Reporting Databricks Job
  • 22. Delta Time Travel Merge Auto-optimize
  • 23. Databricks for Analytics and Data Science
  • 24. Analytics Use Cases Exploratory and root cause analysis Analysis for Strategic Decisions POC for new metrics and business logic Creating and refreshing ad-hoc datasets Team Onboarding Templates
  • 26. Data Science Use Cases Exploration, Sizing Feature generation Model training Scoring Experiments Analyzing results Model serving
  • 27. Big Wins: Data Science Faster local stack to cloud cycle No infrastructure overhead Increased ML adoption across teams Governance & Tracking
  • 29. Key Takeaways Delivery time reduced by 30% Decreased infrastructure costs by 60% Databricks used by 50% of all Atlassians Reduced Data team dependencies by more than 70%
  • 31. Feedback Your feedback is important to us Don’t forget to rate and review the sessions