Scaling and Modernizing Data Platform with Databricks

2 likes797 views

This document summarizes Atlassian's adoption of Databricks to manage their growing data pipelines and platforms. It discusses the challenges they faced with their previous architecture around development time, collaboration, and costs. With Databricks, Atlassian was able to build scalable data pipelines using notebooks and connectors, orchestrate workflows with Airflow, and provide self-service analytics and machine learning to teams while reducing infrastructure costs and data engineering dependencies. The key benefits included reduced development time by 30%, decreased infrastructure costs by 60%, and increased adoption of Databricks and self-service across teams.

Most read

Managing & Scaling Data Pipelines with
Databricks
Esha Shah
Senior Data Engineer
ATLASSIAN
Go-To-Market Data Engineering
Richa Singhal
Senior Data Engineer

Agenda
Atlassian Overview
Summary
Adopting Databricks
Data Platform Challenges

Scaling and Modernizing Data Platform with Databricks

Growth over the last 5 years
Data is now 20x times (Multi petabytes)
5x growth in numbers of internal users
5x number of events/day (Billions)

Atlassian Data Architecture (Before Databricks)

Key Challenges with Legacy Architecture
Development
Cross-team dependencies
Cluster management
Collaboration

Prepping for Scale
Self-service
Standardization
Automation
Agility
Cost Optimization

Our Success Story
Reduced development time
Rapid Development
Increased team and project eﬃciency with
simpliﬁed sharing and co-authoring
Collaboration
Were able to support growth while
reducing Infrastructure cost
Scaling
Removed Data engineering dependency for
Analytics and Data Science teams
Self Service

Adopting Databricks at Atlassian
Building Data Pipelines
Orchestration
Leveraging Databricks
Delta
Databricks for Analytics and
Data Science

Data Pipelines with Databricks
Data Pipelines using Notebooks
Data Pipelines using DB-Connect

Development using Databricks Notebook
AWS Cloud
Interactive
Cluster
Ephemeral
Cluster
Bitbucket
Branch
Databricks Workspace
Import/
Export
Jira Ticket
Command
Line
Databricks
Notebook
Databricks Cluster

Multi-stage Envs using Databricks Workspaces
Databricks
Notebook
Databricks
Workspace
Dev Folder
Local/
Development
Stage/
Production
Bitbucket CICD
Pipeline
Stg Folder
Prod Folder
Stg Cluster
Prod Cluster

$Bitbucket CICD Pipeline branches: main: - step: name: Check configuration file deployment: test script: - pip install -r requirements.txt - 'yamllint -d "{extends: default, rules: {}" config.yaml' - python databricks_cicd/check_duplicates.py - step: name: Move code to Databricks deployment: production caches: - pip script: - pip install -r requirements.txt - bash databricks_cicd/move_code_to_databricks.sh prod - step: name: Update the job in Databricks script: - pip install -r requirements.txt - python databricks_cicd/configure_job_in_databricks.py$

Development using DB-Connect Library
AWS Cloud
Interactive
Cluster
Ephemeral
Cluster
Bitbucket
Branch
Local IDE
Pull Request
/Merge
db-connect
Jira Ticket
Databricks Cluster

Multi-stage Envs using AWS S3
Local IDE Databricks
Cluster
Dev Bucket
Local/
Development
Stage/
Production
Bitbucket CICD
Pipeline
Docker
Stg Bucket
Prod Bucket
Stg Cluster
Prod Cluster

Orchestration using Airﬂow
Airﬂow on
Kubernetes
SparkSubmit Task
YODA
In-house Data
Quality Platform
SignalFx
Opsgenie
On-Call
Notebook Task
Slack Notiﬁcation
Code on S3
Notebook
Databricks Workspace

$Tracking Resource Usage and Cost Job Metadata 'custom_tags': { 'business_unit': 'Data Engineering', 'environment': cluster_env, 'pipeline': 'Team_name', 'user': 'airflow', 'resource_owner': '<resource_owner>', 'service_name': '<service-name>' } Data Lake Ad Hoc Reporting Databricks Job$

Databricks for Analytics and Data Science

Analytics Use Cases
Exploratory and root cause analysis
Analysis for Strategic Decisions
POC for new metrics and business logic
Creating and refreshing ad-hoc datasets
Team Onboarding Templates

Big Wins: Analytics
Self-service Collaboration

Data Science Use Cases
Exploration, Sizing
Feature generation
Model training
Scoring
Experiments
Analyzing results
Model serving

Big Wins: Data Science
Faster local
stack to cloud
cycle
No
infrastructure
overhead
Increased ML
adoption
across teams
Governance &
Tracking

Key Takeaways
Delivery time reduced by 30%
Decreased infrastructure costs by 60%
Databricks used by 50% of all Atlassians
Reduced Data team dependencies by
more than 70%

Feedback
Your feedback is important to us
Don’t forget to rate and review the sessions

The document discusses Delta Live Tables (DLT), a tool from Databricks that allows users to build reliable data pipelines in a declarative way. DLT automates complex ETL tasks, ensures data quality, and provides end-to-end visibility into data pipelines. It unifies batch and streaming data processing with a single SQL API. Customers report that DLT helps them save significant time and effort in managing data at scale, accelerates data pipeline development, and reduces infrastructure costs.

What’s New with Databricks Machine LearningDatabricks

The document discusses the complexities and collaborative nature of AI, emphasizing the necessity for integration between software and data engineering. It highlights the need for robust tooling and a cohesive environment to effectively manage the entire machine learning lifecycle, including data preparation, model training, and governance. Additionally, it introduces Databricks' machine learning platform as a data-native and collaborative solution to streamline these processes.

Modernizing to a Cloud Data ArchitectureDatabricks

The document discusses the urgency for enterprises to transition from traditional Hadoop architectures to cloud-based solutions like Databricks due to rising costs and inefficiencies. It highlights significant business benefits, including increased revenue and productivity, as well as the advantages of a unified data platform for analytics and AI workloads. The content emphasizes the importance of modernization in achieving innovation and competitive advantage in an era of accelerated digital transformation.

Learn to Use Databricks for Data ScienceDatabricks

The document outlines the challenges and workflows involved in data science, emphasizing the need for proper setup and resource management. It highlights the importance of sharing results with stakeholders and describes how Databricks' lakehouse platform simplifies these processes by integrating data sources and providing essential tools for data analysis. Overall, the goal is to help data scientists focus on their core analytical work rather than dealing with setup complexities.

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

The document discusses various modern data architectures including data warehouses, data lakes, data mesh, and data fabric, explaining their purposes and functionalities. It provides an overview of advantages and reasons to implement these systems, highlighting the differences between centralized and decentralized data management approaches. The author emphasizes the importance of understanding these architectures to make informed decisions regarding data strategy and infrastructure.

Data engineering design patternsValdas Maksimavičius

The document discusses data engineering patterns and principles, emphasizing the importance of standardized processes to improve efficiency and descriptive power in software development. It highlights various aspects of data architecture, such as cloud readiness, ingestion strategies, and the differences between data warehouses and data lakes. Additionally, it addresses cultural principles for DevOps practices and emphasizes the necessity of collaboration, customer focus, and continuous improvement in data projects.

Introducing Databricks DeltaDatabricks

The document discusses the evolution of data platforms, highlighting the need to unify data warehousing with data lakes to enhance analytics capabilities across various industries. It introduces Databricks Delta, which combines the scalability of data lakes with the performance of data warehouses, enabling real-time analytics and machine learning. Key benefits of this unified approach include improved data reliability, faster query performance, and the ability to handle massive amounts of data efficiently.

Moving to Databricks & DeltaDatabricks

The document discusses the migration of data processes at wetter.com to Databricks for improved productivity and collaboration in data management. The shift was motivated by the need for faster GDPR compliance, enhanced usability, and reduced costs through shared clusters. It also outlines the technical architecture involved, differences in workflow management between EMR and Databricks, and the benefits of using Delta tables for data management.

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

The document presents an overview of building lakehouses using Delta Lake and SQL Analytics, covering architecture, optimizations, and implementation examples. It emphasizes the benefits of SQL Analytics for data analysis, simplified administration, and governance, highlighting its integration with BI tools. Strategies for data management, including structured layers (bronze, silver, gold) and frictionless ETL processes, are discussed to ensure efficient data usability and performance.

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

This document introduces data integration in Azure Synapse Analytics, emphasizing the importance of data in business intelligence and analytics projects. It details the functionalities of Azure Synapse Analytics, including data ingestion, transformation, and integration capabilities without needing to write code. The session also highlights the collaborative nature of the platform for data engineers, analysts, scientists, and consumers.

Databricks FundamentalsDalibor Wijas

This document is a training presentation on Databricks fundamentals and the data lakehouse concept by Dalibor Wijas from November 2022. It introduces Wijas and his experience. It then discusses what Databricks is, why it is needed, what a data lakehouse is, how Databricks enables the data lakehouse concept using Apache Spark and Delta Lake. It also covers how Databricks supports data engineering, data warehousing, and offers tools for data ingestion, transformation, pipelines and more.

DW Migration Webinar-March 2022.pptxDatabricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

The document provides a detailed comparison of Delta Lake, Apache Iceberg, and Apache Hudi, focusing on their key features such as transaction support, data mutation, and streaming capabilities. Delta Lake offers strong integration with Spark, Iceberg excels in design abstraction for large datasets, and Hudi specializes in streaming data ingestion. The analysis covers aspects like schema evolution, performance optimization, and tooling, concluding that each technology has its unique strengths and use cases.

Microsoft Azure DatabricksSascha Dittmann

The document discusses the functionalities and features of Apache Spark and Azure Databricks, highlighting their ease of use, speed, and generality across big data processing tasks. It outlines the various components such as Spark SQL, Spark MLlib, and Spark Streaming, along with operational aspects like job scheduling and data management. Additionally, it emphasizes the integration with cloud services and big data technologies, promoting enhanced productivity and reduced administrative overhead.

Delta lake and the delta architectureAdam Doyle

- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS. - It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics. - Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.

Databricks Platform.pptxAlex Ivy

The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

The document discusses the architecture and implementation of streaming data lakes using Kafka Connect and Apache Hudi. It covers the importance of data lakes in modern analytics, the functionality and advantages of Hudi for managing data streams and transactional writes, and a case study of Robinhood's data lake solution. Key features, usage examples, and community engagement avenues are also presented to facilitate adoption and integration of Hudi in streaming infrastructures.

Using Databricks as an Analysis PlatformDatabricks

The document discusses the use of Databricks as an analytics platform, highlighting features such as ETL workload management, various data types supported, and the integration of 3rd party data. It also emphasizes the importance of automation using tools like Airflow for workflow management, along with best practices for data handling and storage. Additionally, the document outlines YipitData's research offerings and encourages feedback and job applications.

Big data architectures and the data lakeJames Serra

The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include: - Defining top-down and bottom-up approaches to data management - Explaining what a data lake is and how Hadoop can function as the data lake - Describing how a modern data warehouse combines features of a traditional data warehouse and data lake - Discussing how federated querying allows data to be accessed across multiple sources - Highlighting benefits of implementing big data solutions in the cloud - Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (

Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY

The document outlines the principles and practicalities of launching a data catalog and implementing data mesh strategies to enhance data governance. It emphasizes decentralization, domain-centric ownership, and collaboration among teams to improve data accessibility and quality while eliminating silos. Key considerations include identifying stakeholders, standardizing data products, and adopting agile methodologies for governance.

Intro to Delta LakeDatabricks

Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.

Achieving Lakehouse Models with Spark 3.0Databricks

The document discusses the advancements in lakehouse models with Spark 3.0, focusing on overcoming challenges associated with traditional data warehousing methods like Kimball. Key features include the use of Slowly Changing Dimensions (SCD), dynamic partition pruning, and adaptive query execution (AQE) that enhance performance and simplify complex tasks. The Delta file format facilitates these advancements by enabling familiar SQL patterns and improving analytics capabilities.

Building End-to-End Delta Pipelines on GCPDatabricks

The document presents an overview of building end-to-end Delta pipelines on Google Cloud Platform (GCP) using Delta Lake, highlighting its advantages such as reliability, performance, and support for various data formats. It discusses the architectural framework that enables efficient data management, real-time processing, and enhanced collaboration across data ecosystems. Additionally, it covers the integration of Databricks with Google Cloud for streamlined data access and analytics workflows.

Free Training: How to Build a LakehouseDatabricks

Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.

Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko

This document provides an overview of Azure Databricks, a Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It discusses key components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks File System. It also outlines how data engineers can leverage Azure Databricks for scenarios like running ETL pipelines, streaming analytics, and connecting business intelligence tools to query data.

Databricks: A Tool That Empowers You To Do More With DataDatabricks

The document outlines a presentation by Francois Callewaert, a senior data scientist at Databricks, focusing on data lifecycle and processes involved in data analytics and machine learning. It covers the roles of data engineers, analysts, and scientists in handling datasets, as well as a practical demo leveraging tools like Azure and Spark for data management and analysis. The presentation emphasizes the integration of various technologies to enhance data-driven business decisions.

Modern Data architecture DesignKujambu Murugesan

The document discusses modern data architectures. It presents conceptual models for data ingestion, storage, processing, and insights/actions. It compares traditional vs modern architectures. The modern architecture uses a data lake for storage and allows for on-demand analysis. It provides an example of how this could be implemented on Microsoft Azure using services like Azure Data Lake Storage, Azure Data Bricks, and Azure Data Warehouse. It also outlines common data management functions such as data governance, architecture, development, operations, and security.

Continuous Integration & Continuous DeliveryDatabricks

The document discusses continuous integration (CI) and continuous delivery (CD) practices, focusing on building and deploying data pipelines using Databricks. It outlines the stages of development, best practices for notebook development, and the tools and features available in Databricks for handling CI/CD processes. Additionally, it highlights the significance of collaboration and optimization for data operations within the Databricks platform.

The Hidden Value of Hadoop MigrationDatabricks

The document discusses the advantages of migrating from Hadoop to Databricks, highlighting benefits such as significant cost savings, improved performance, and increased productivity. It showcases successful migration stories, including enhanced data analytics and machine learning capabilities that drive business value. The content emphasizes a structured migration approach through automation and expert support to minimize risks and cut costs.

More Related Content

What's hot (20)

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Databricks FundamentalsDalibor Wijas

DW Migration Webinar-March 2022.pptxDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Microsoft Azure DatabricksSascha Dittmann

Delta lake and the delta architectureAdam Doyle

Databricks Platform.pptxAlex Ivy

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Using Databricks as an Analysis PlatformDatabricks

Big data architectures and the data lakeJames Serra

Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY

Intro to Delta LakeDatabricks

Achieving Lakehouse Models with Spark 3.0Databricks

Building End-to-End Delta Pipelines on GCPDatabricks

Free Training: How to Build a LakehouseDatabricks

Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko

Databricks: A Tool That Empowers You To Do More With DataDatabricks

Modern Data architecture DesignKujambu Murugesan

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Databricks FundamentalsDalibor Wijas

DW Migration Webinar-March 2022.pptxDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Microsoft Azure DatabricksSascha Dittmann

Delta lake and the delta architectureAdam Doyle

Databricks Platform.pptxAlex Ivy

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Using Databricks as an Analysis PlatformDatabricks

Big data architectures and the data lakeJames Serra

Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY

Intro to Delta LakeDatabricks

Achieving Lakehouse Models with Spark 3.0Databricks

Building End-to-End Delta Pipelines on GCPDatabricks

Free Training: How to Build a LakehouseDatabricks

Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko

Databricks: A Tool That Empowers You To Do More With DataDatabricks

Modern Data architecture DesignKujambu Murugesan

Similar to Scaling and Modernizing Data Platform with Databricks (20)

Continuous Integration & Continuous DeliveryDatabricks

The Hidden Value of Hadoop MigrationDatabricks

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks

The document presents a detailed overview of developing machine learning-enabled data pipelines on Databricks, highlighting the challenges and solutions related to implementing CI/CD practices. It emphasizes the importance of CI/CD in ensuring high-quality code and provides a scalable template for integrating Databricks workflows with GitHub actions for seamless testing and deployment. Key insights include the necessity of testing code and data similar to standard software engineering practices and the use of Databricks CI/CD templates to enhance code organization and workflow efficiency.

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

The document discusses Databricks' development of a next-generation data pipeline utilizing Apache Spark, highlighting challenges like fault tolerance and scalability. It outlines the architecture of their data pipeline, including real-time and batch processing capabilities, and shares lessons learned regarding efficiency and cost management. The conclusion emphasizes the benefits of Databricks and Apache Spark as a unified platform for ETL, data warehousing, and analytics.

Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks

The document presents a comprehensive overview of building a high-performance data warehousing platform using Databricks, focusing on key aspects such as infrastructure, data ingestion, ETL processes, and performance optimizations. It highlights the advantages of Databricks over traditional data warehousing methods, emphasizing its speed, flexibility, and ease of use powered by Apache Spark. Additionally, the content addresses security measures, integration with AWS services, and real-world use cases of the platform.

Migrating Your Data Platform At a High Growth StartupDatabricks

Migrating their data platform from AWS EMR and notebooks to Databricks, Abnormal Security conducted a two-week proof-of-concept that was successful. They are now migrating jobs ranked by cost to Databricks' configuration framework over the first quarter to reduce costs by 50% while gaining improved usability, operational overhead, and ability to scale. The migration of their platform to a single environment using Databricks will allow them to build their first data lakehouse and gain additional future use cases as the company grows rapidly.

Master Databricks with AccentFuture – Online TrainingAccentfuture

Leveraging Databricks for Spark pipelinesRose Toomey

Coatue Management improved the efficiency of their Spark pipelines by migrating to Databricks, which enhanced performance and reduced operational overhead. This transition enabled the consolidation of multiple job submissions into a single job, resulting in significant speed improvements for large and medium-sized data pipelines. The overall outcome included faster processing times, more reliable cloud storage operations, and simplified management through a single API.

Leveraging Databricks for Spark PipelinesRose Toomey

Coatue Management migrated its Spark pipelines to Databricks, achieving significant reductions in operational overhead, running times, and costs. Key changes included consolidating multiple jobs into single executions and utilizing Databricks' runtime optimizations, leading to notable performance improvements, such as reducing pipeline completion times from hours to minutes. The transition resulted in more reliable cloud storage interactions and a simplified management through a unified API.

Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks

The document discusses Zalando's implementation of continuous applications at scale using Databricks Delta and Structured Streaming, detailing their data processing platform and various use cases. It highlights challenges faced during batch ingestion, continuous event data processing, and data governance, along with proposed solutions. Key lessons learned include understanding the capabilities of Spark and Delta, automating infrastructure management, and fostering team collaboration through inner source practices.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

The document outlines the challenges and considerations for migrating from Hadoop to Databricks, emphasizing the complexities of the Hadoop ecosystem and the advantages of a modern cloud-based data architecture. It provides a comprehensive migration plan that includes internal assessments, technical planning, and execution while addressing key topics such as data migration, security, and SQL integration. Specific tools and methodologies for effective transition and enhanced performance in data analytics are also discussed.

Embedded-ml(ai)applications - Bjoern StaenderDataconomy Media

The document discusses embedding machine learning in business processes using the example of baking cakes. It notes that while bakers follow exact recipes and processes, the results are not always perfect due to various factors. It then discusses how manufacturers are "data rich but information poor" as they cannot derive meaningful insights from their operational data. The document advocates generating "actionable intelligence" through deep analysis of production data to determine the root causes of issues like cracked cakes, rather than just reporting what problems occurred. This would help manufacturers diagnose and address process flaws more precisely.

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks

The document discusses lessons learned from building Databricks' large-scale, multi-cloud SaaS platform, emphasizing the importance of the development process over the platform itself. Key takeaways include the need for a cloud-agnostic architecture that accommodates different cloud standards, and leveraging data and AI to enhance the platform's features and operational efficiency. The presentation highlights operational challenges, scaling strategies, and the significance of modern development practices and infrastructure.

Migration to Databricks - On-prem HDFS.pptxKshitija(KJ) Gupte

The document outlines a migration plan for transitioning from an on-premises HDFS system to Microsoft Azure Databricks, emphasizing the platform's unified data capabilities, scalability, and collaborative features for data engineering, analytics, and machine learning. The proposed approach involves a phased migration strategy, integrating real-time processing with Kafka, and utilizing Azure's data storage solutions while ensuring security and compliance. Key components of the migration include discovery and assessment of the existing environment, designing an appropriate architecture, and implementing a structured deployment plan with extensive testing before going live.

Introduction to Databricks - AccentFutureAccentfuture

Introduction to Databricks | Databricks Overview Databricks is a unified analytics platform designed for big data and AI workflows. By integrating data engineering, data science, and machine learning, it provides a collaborative environment for teams to efficiently work on large-scale data projects. Built on Apache Spark, it accelerates data processing and analytics, enabling businesses to derive actionable insights faster and more efficiently. Databricks Overview & Azure Databricks Introduction: Azure Databricks is a fully managed, cloud-based platform that integrates the power of Databricks with the scalability and security of Microsoft Azure. It simplifies the process of building data pipelines, running machine learning models, and processing large datasets. Azure Databricks seamlessly connects with Azure services like Azure Blob Storage and Azure Machine Learning, enabling users to process both batch and streaming data in a highly scalable environment. Databricks Lakehouse Overview & Key Features: Databricks Lakehouse combines the best of data lakes and data warehouses into one unified platform. It supports structured and unstructured data, allowing users to run advanced analytics and machine learning workflows. The Lakehouse architecture streamlines data management, offers faster insights, and enhances collaboration across teams. Whether you are using Databricks for big data processing or leveraging Azure Databricks for cloud-based analytics, the platform offers a flexible and powerful solution for modern data engineering. #Databricks hashtag#databrickstutorials hashtag#DatabricksOverview AccentFuture 🚀 Advance Your IT Career with Real-Time Online Training! 💻 ✅ Learn from Industry Experts with hands-on real-time scenarios 📅 Free Demo Available! 🌐 Website: www.accentfuture.com ➡️ Register Now: https://shorturl.at/UUY6H 📧 Email: [email protected] 📞 Call/WhatsApp: +91-9640001789 🔗 Course URL: https://accentfuture.com/courses/databricks-training/

Databricks @ Strata SJDatabricks

Databricks Cloud simplifies the big data process by providing a unified, cloud-hosted platform powered by Spark, making it easier to manage data pipelines and clusters. It offers instant deployment of managed Spark clusters, an interactive notebook environment, and tools for real-time collaboration and visualization. Founded by the creators of Spark, Databricks aims to streamline the big data journey, minimizing costs and complexities associated with traditional tools.

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Spark Summit

This document discusses building an on-premise analytics pipeline using Spark. It summarizes challenges faced including sharing a cluster for different environments, orchestrating multi-step jobs, performance concerns due to resource contention, debugging issues, and lack of development tooling. Solutions proposed include using Docker to isolate environments, Luigi for job orchestration, optimizing resource allocation, logging to Graylog, and developing custom tools. The next steps suggested are moving to the cloud to simplify development and enable broader insights.

Productionalizing a spark applicationdatamantra

1. The document discusses the process of productionalizing a financial analytics application built on Spark over multiple iterations. It started with data scientists using Python and data engineers porting code to Scala RDDs. They then moved to using DataFrames and deployed on EMR. 2. Issues with code quality and testing led to adding ScalaTest, PR reviews, and daily Jenkins builds. Architectural challenges were addressed by moving to Databricks Cloud which provided notebooks, jobs, and throwaway clusters. 3. Future work includes using Spark SQL windows and Dataset API for stronger typing and schema support. The iterations improved the code, testing, deployment, and use of latest Spark features.

Future of Data Platform in Cloud Native worldSrivatsan Srinivasan

The document discusses the future of data platforms in a cloud-native era, questioning whether Hadoop has become obsolete due to its performance issues and complex maintenance. It outlines the challenges faced by big data platforms, such as infrastructure management, dependency and version management, and the need for hybrid cloud portability. The document concludes with proposed solutions, including decoupling compute and storage, utilizing Kubernetes operators, and establishing a common runtime layer across environments to enhance efficiency and address current data needs.

2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08

The document discusses the challenges and opportunities associated with large distributed computing infrastructures and big data management. It highlights the exponential growth of digital data, the distinction between structured and unstructured data, and the necessity for companies to rethink their data handling strategies using cloud computing and advanced analytics. Key recommendations include treating big data projects as business mandates, leveraging human intelligence, and integrating various technologies and systems to improve data analysis while addressing the skills shortage in the analytics field.

Continuous Integration & Continuous DeliveryDatabricks

The Hidden Value of Hadoop MigrationDatabricks

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks

Migrating Your Data Platform At a High Growth StartupDatabricks

Master Databricks with AccentFuture – Online TrainingAccentfuture

Leveraging Databricks for Spark pipelinesRose Toomey

Leveraging Databricks for Spark PipelinesRose Toomey

Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Embedded-ml(ai)applications - Bjoern StaenderDataconomy Media

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks

Migration to Databricks - On-prem HDFS.pptxKshitija(KJ) Gupte

Introduction to Databricks - AccentFutureAccentfuture

Databricks @ Strata SJDatabricks

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Spark Summit

Productionalizing a spark applicationdatamantra

Future of Data Platform in Cloud Native worldSrivatsan Srinivasan

2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1Databricks

The document discusses the concept of a data lakehouse, highlighting the integration of structured, textual, and analog/IOT data. It emphasizes the importance of common identifiers and universal connectors for meaningful analytics across different data types, ultimately aiming to improve healthcare and manufacturing outcomes through effective data analysis. The presentation outlines the challenges of managing diverse data formats and the potential for data-driven insights to enhance quality of life.

Data Lakehouse Symposium | Day 1 | Part 2Databricks

The document compares data lakehouses and data warehouses, outlining their similarities and differences. Both serve analytical processing and contain vetted, historical data, but the data lakehouse handles a much larger volume of machine-generated data and features fundamentally different structures from transaction-based data warehouses. Ultimately, they are presented as related yet distinct entities in the realm of data management.

Data Lakehouse Symposium | Day 2Databricks

The Data Lakehouse Symposium held in February 2022 discussed the evolution of data management from data warehouses to lakehouses, emphasizing the integration of governance and metadata. It highlighted the challenges companies face in utilizing various types of data, particularly unstructured textual data, and the importance of adding context for effective analysis. The presentation also examined strategies for transforming unstructured data into structured formats to enable better decision-making and analytical processes.

Data Lakehouse Symposium | Day 4Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

Democratizing Data Quality Through a Centralized PlatformDatabricks

Zillow's Data Governance Platform team addresses data quality challenges by creating a centralized platform that enhances visibility and standardizes data quality rules. The platform includes self-service capabilities and integrates with data lineage, allowing for built-in alerting and scalable onboarding. Key takeaways emphasize the importance of early alerting, collaboration, and the shared responsibility for maintaining high-quality data to improve decision-making.

Why APM Is Not the Same As ML MonitoringDatabricks

The document discusses the distinctions between application performance monitoring (APM) and machine learning (ML) monitoring, emphasizing the unique challenges of ML monitoring, such as the need for intelligent detection and alerting. It outlines the essential components of ML monitoring, including statistical summarization, distribution comparison, and actionable alerts based on model performance. Additionally, it introduces Verta's end-to-end MLOps platform designed to meet the specialized needs of ML monitoring throughout the entire model lifecycle.

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Elijah Ben Izzy, a Data Platform Engineer at Stitch Fix, discusses building abstractions for machine learning operations to optimize workflows and enhance the separation of concerns between data science and platform engineering. The presentation highlights the importance of a custom-built model envelope for seamless integration and management of ML models, as well as advancements in deployment and inference processes. Future directions include enhanced production monitoring and sophisticated feature integration to further streamline data science workflows.

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

The document discusses stage-level scheduling and resource allocation in Apache Spark to enhance big data and AI integration. It outlines various resource requirements such as executors, memory, and accelerators, while presenting benefits like improved hardware utilization and simplified application pipelines. Additionally, it introduces the RAPIDS Accelerator for Spark and distributed deep learning with Horovod, emphasizing performance optimizations and future enhancements.

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

The document discusses the importance of data conversion between Spark and deep learning frameworks like TensorFlow and PyTorch. It highlights key pain points, such as challenges in migrating from single-node to distributed training and the complexity of saving and loading data. Additionally, it introduces the Spark Dataset Converter, which simplifies data handling while training deep learning models and offers best practices for efficient usage.

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

This document discusses the integration of Apache Spark with Kubernetes on Google Cloud, highlighting its advantages for running data engineering and machine learning workloads within existing infrastructure. It outlines benefits such as improved cost optimization, faster scaling, and enhanced resource management through Google Kubernetes Engine (GKE) and Dataproc, while detailing implementation steps and monitoring options. Additionally, it covers the compatibility with big data ecosystem tools, job execution, and enterprise security features.

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

The document discusses the Sawtooth Windows Zipline, a feature engineering framework focusing on machine learning with structured data. It emphasizes the importance of real-time, stable, and consistent features for model training and serving, while highlighting the challenges of data sources and the intricacies of aggregations. Key topics include model complexity, data quality, and various types of windowed aggregations for efficient data processing.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

The document discusses the integration of Redis with Apache Spark for managing long-running batch jobs and distributed counters. It outlines the challenges faced in submitting queries and the inefficiencies of existing solutions, proposing a system that utilizes Redis for queuing and job status communication. Key workflows and code views are provided to demonstrate the proposed solutions for efficient query handling and data processing.

Re-imagine Data Monitoring with whylogs and SparkDatabricks

The document discusses the challenges of monitoring machine learning data, emphasizing how traditional data analysis techniques fall short in addressing issues in ML data pipelines. It introduces the open-source library Whylogs for data logging, highlighting its lightweight profiling methods suitable for large datasets and integration with Apache Spark. Key topics include data quality problems, the need for scalable monitoring, and approaches for logging and analyzing ML data effectively.

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

The document discusses Raven, an optimizer for machine learning prediction queries at Microsoft, focusing on its ability to improve the performance of SQL-based ML operations. It details how Raven integrates with Azure data engines, utilizing techniques like model projection pushdown and model-to-SQL translation to enhance query efficiency. Performance evaluations indicate that Raven significantly outperforms existing ML runtimes in various scenarios, achieving speed increases of up to 44 times compared to traditional approaches.

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

The document outlines the use of Spark for processing large datasets in automated driving applications, focusing on semantic segmentation and the challenges of moving from prototype to production. It presents the architecture of the system, covering ETL processes, model training, and inference, while addressing design considerations like scaling, security, and governance. Key takeaways emphasize the importance of leveraging cloud-based solutions and effective workflow management to enhance the development of perception software for autonomous vehicles.

Massive Data Processing in Adobe Using Delta LakeDatabricks

The document discusses massive data processing at Adobe using Delta Lake, highlighting various aspects such as data representation, schema evolution, and challenges in data ingestion. It emphasizes the performance benefits of utilizing Delta Lake for handling large-scale data efficiently, while considering issues like schema management and replication lag. Key features like ACID transactions and lazy schema on-read approaches are also outlined to address the complexities of multi-tenant data architecture.

Machine Learning CI/CD for Email Attack DetectionDatabricks

The document discusses the need for continuous machine learning integration and delivery (CI/CD) to enhance email attack detection against various forms of fraud like invoice payment fraud and social engineering. It outlines the challenges faced in the machine learning domain, including the rarity of attacks and the high precision required, while proposing a CI/CD approach that allows for rapid development without sacrificing system integrity. Ultimately, it emphasizes that a well-designed CI/CD system can lead to faster iterations and improved product stability in the fight against sophisticated email threats.

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

The document discusses the role of an AI chatbot, developed by Unravel, in optimizing data operations (DataOps) for managing complex data pipelines across various companies. It highlights the challenges organizations face, such as managing cloud costs, app performance issues, and SLA misses, and demonstrates how the chatbot can simplify troubleshooting tasks. The overarching theme emphasizes the importance of integrating AI-driven solutions to enhance the efficiency and effectiveness of data management practices.

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

This presentation introduces Tune and Fugue, frameworks for intuitive and scalable hyperparameter optimization (HPO). Tune supports both non-iterative and iterative HPO problems. For non-iterative problems, Tune supports grid search, random search, and Bayesian optimization. For iterative problems, Tune generalizes algorithms like Hyperband and Asynchronous Successive Halving. Tune allows tuning models both locally and in a distributed manner without code changes. The presentation demonstrates Tune's capabilities through examples tuning Scikit-Learn and Keras models. The goal of Tune and Fugue is to make HPO development easy, testable, and scalable.

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Recently uploaded (20)

SUNSSE Engineering Introduction 2021.pdfOngkino

Grade 10 selection and placement (1).pptxFIDELISMUSEMBI

apidays New York 2025 - API Security and Observability at Scale in Kubernetes...apidays

API Security and Observability at Scale in Kubernetes Ben Urbanski, Product Manager for Layer7 at Broadcom Geoffrey Duck, Solution Engineer and Service Lead for Layer7 at Broadcom apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITYAmeya Patekar

Attendance Presentation Project Excel.pptxs2025266191

KLIP2Data voor de herinrichting van R4 West en Oostjacoba18

apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...apidays

Beyond Webhooks: The Future of Scalable API Event Delivery Phil Leggetter, Head of Developer Experience at Hookdeck apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Grote OSM datasets zonder kopzorgen bij Reijersjacoba18

FME Beyond Data Processing: Creating a Dartboard Accuracy Appjacoba18

Residential Zone 4 for industrial villageMdYasinArafat13

METHODS OF DATA COLLECTION (Research methodology)anwesha248

What is FinOps as a Service and why is it Trending?Amnic

The way we build and scale companies today has changed forever because of cloud adoption. However, this flexibility introduces unpredictability, which often results in overspending, inefficiencies, and a lack of cost accountability. FinOps as a Service is a modern approach to cloud cost management that combines powerful tooling with expert advisory to bring financial visibility, governance, and optimization into the cloud operating model, without slowing down the engineering team. FinOps empowers the engineering team, finance, and leadership/management as they make data-informed decisions about cost, together. In this presentation, we will break down what FinOps is, why it matters more than ever, and a little about how a managed FinOps service can help organizations: - Optimize cloud spend - without slowing down dev - Create visibility into the cost per team, service, or feature - Set financial guardrails while allowing autonomy in engineering - Drive cultural alignment between finance, engineering, and product This will guide and help whether you are a cloud-native startup or a scaling enterprise, and convert cloud cost into a strategic advantage.

apidays New York 2025 - The Future of Small Business Lending with Open Bankin...apidays

The Future of Small Business Lending with Open Banking – Bridging the $750 Billion Funding Gap Charles Groome, Vice President of Growth Strategy at Biz2Credit apidays New York 2025 API Management for Surfing the Next Innovation Waves: GenAI and Open Banking May 14 & 15, 2025 ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Section Three - Project colemanite production ChinaVavaniaM

最新版美国亚利桑那大学毕业证（UA毕业证书）原版定制Taqyea

鉴于此，定制亚利桑那大学学位证书提升履历【q薇1954292140】原版高仿亚利桑那大学毕业证(UA毕业证书)可先看成品样本【q薇1954292140】帮您解决在美国亚利桑那大学未毕业难题，美国毕业证购买，美国文凭购买，【q微1954292140】美国文凭购买，美国文凭定制，美国文凭补办。专业在线定制美国大学文凭，定做美国本科文凭，【q微1954292140】复制美国The University of Arizona completion letter。在线快速补办美国本科毕业证、硕士文凭证书，购买美国学位证、亚利桑那大学Offer，美国大学文凭在线购买。如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金【复刻一套亚利桑那大学毕业证成绩单信封等材料最强攻略,Buy The University of Arizona Transcripts】购买日韩成绩单、英国大学成绩单、美国大学成绩单、澳洲大学成绩单、加拿大大学成绩单（q微1954292140）新加坡大学成绩单、新西兰大学成绩单、爱尔兰成绩单、西班牙成绩单、德国成绩单。成绩单的意义主要体现在证明学习能力、评估学术背景、展示综合素质、提高录取率，以及是作为留信认证申请材料的一部分。亚利桑那大学成绩单能够体现您的的学习能力，包括亚利桑那大学课程成绩、专业能力、研究能力。（q微1954292140）具体来说，成绩报告单通常包含学生的学习技能与习惯、各科成绩以及老师评语等部分，因此，成绩单不仅是学生学术能力的证明，也是评估学生是否适合某个教育项目的重要依据！

SAP_S4HANA_EWM_Food_Processing_Industry.pptxvemulavenu484

Addressing-the-Air-Quality-Crisis-in-New-Delhi.pptxmanpreetkaur3469

Report_Government Authorities_Index_ENG_FIN.pdfOlhaTatokhina1

Data-Driven-Operational--Excellence.pptxNiwanthaThilanjanaGa

5. & 9. Packing material and Labelling_AP-60,XP-60.pdfmaricruzduranpaterni

SUNSSE Engineering Introduction 2021.pdfOngkino

Grade 10 selection and placement (1).pptxFIDELISMUSEMBI

apidays New York 2025 - API Security and Observability at Scale in Kubernetes...apidays

REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITYAmeya Patekar

Attendance Presentation Project Excel.pptxs2025266191

KLIP2Data voor de herinrichting van R4 West en Oostjacoba18

apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...apidays

Grote OSM datasets zonder kopzorgen bij Reijersjacoba18

FME Beyond Data Processing: Creating a Dartboard Accuracy Appjacoba18

Residential Zone 4 for industrial villageMdYasinArafat13

METHODS OF DATA COLLECTION (Research methodology)anwesha248

What is FinOps as a Service and why is it Trending?Amnic

apidays New York 2025 - The Future of Small Business Lending with Open Bankin...apidays

Section Three - Project colemanite production ChinaVavaniaM

最新版美国亚利桑那大学毕业证（UA毕业证书）原版定制Taqyea

SAP_S4HANA_EWM_Food_Processing_Industry.pptxvemulavenu484

Addressing-the-Air-Quality-Crisis-in-New-Delhi.pptxmanpreetkaur3469

Report_Government Authorities_Index_ENG_FIN.pdfOlhaTatokhina1

Data-Driven-Operational--Excellence.pptxNiwanthaThilanjanaGa

5. & 9. Packing material and Labelling_AP-60,XP-60.pdfmaricruzduranpaterni

Scaling and Modernizing Data Platform with Databricks

1. Managing & Scaling Data Pipelines with Databricks Esha Shah Senior Data Engineer ATLASSIAN Go-To-Market Data Engineering Richa Singhal Senior Data Engineer

2. Agenda Atlassian Overview Summary Adopting Databricks Data Platform Challenges

4. Growth over the last 5 years Data is now 20x times (Multi petabytes) 5x growth in numbers of internal users 5x number of events/day (Billions)

5. Atlassian Data Architecture (Before Databricks)

6. Key Challenges with Legacy Architecture Development Cross-team dependencies Cluster management Collaboration

7. Prepping for Scale Self-service Standardization Automation Agility Cost Optimization

8. Current Atlassian Data Architecture

9. Our Success Story Reduced development time Rapid Development Increased team and project eﬃciency with simpliﬁed sharing and co-authoring Collaboration Were able to support growth while reducing Infrastructure cost Scaling Removed Data engineering dependency for Analytics and Data Science teams Self Service

10. Adopting Databricks at Atlassian Building Data Pipelines Orchestration Leveraging Databricks Delta Databricks for Analytics and Data Science

11. Building Data Pipelines

12. Data Pipelines with Databricks Data Pipelines using Notebooks Data Pipelines using DB-Connect

13. Development using Databricks Notebook AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Databricks Workspace Import/ Export Jira Ticket Command Line Databricks Notebook Databricks Cluster

14. Multi-stage Envs using Databricks Workspaces Databricks Notebook Databricks Workspace Dev Folder Local/ Development Stage/ Production Bitbucket CICD Pipeline Stg Folder Prod Folder Stg Cluster Prod Cluster

15. Bitbucket CICD Pipeline branches: main: - step: name: Check configuration file deployment: test script: - pip install -r requirements.txt - 'yamllint -d "{extends: default, rules: {}" config.yaml' - python databricks_cicd/check_duplicates.py - step: name: Move code to Databricks deployment: production caches: - pip script: - pip install -r requirements.txt - bash databricks_cicd/move_code_to_databricks.sh prod - step: name: Update the job in Databricks script: - pip install -r requirements.txt - python databricks_cicd/configure_job_in_databricks.py

16. Development using DB-Connect Library AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Local IDE Pull Request /Merge db-connect Jira Ticket Databricks Cluster

17. Multi-stage Envs using AWS S3 Local IDE Databricks Cluster Dev Bucket Local/ Development Stage/ Production Bitbucket CICD Pipeline Docker Stg Bucket Prod Bucket Stg Cluster Prod Cluster

18. Orchestration

19. Orchestration using Airflow Airflow on Kubernetes SparkSubmit Task YODA In-house Data Quality Platform SignalFx Opsgenie On-Call Notebook Task Slack Notification Code on S3 Notebook Databricks Workspace

20. Tracking Resource Usage and Cost Job Metadata 'custom_tags': { 'business_unit': 'Data Engineering', 'environment': cluster_env, 'pipeline': 'Team_name', 'user': 'airflow', 'resource_owner': '<resource_owner>', 'service_name': '<service-name>' } Data Lake Ad Hoc Reporting Databricks Job

21. Leveraging Databricks Delta

22. Delta Time Travel Merge Auto-optimize

23. Databricks for Analytics and Data Science

24. Analytics Use Cases Exploratory and root cause analysis Analysis for Strategic Decisions POC for new metrics and business logic Creating and refreshing ad-hoc datasets Team Onboarding Templates

25. Big Wins: Analytics Self-service Collaboration

26. Data Science Use Cases Exploration, Sizing Feature generation Model training Scoring Experiments Analyzing results Model serving

27. Big Wins: Data Science Faster local stack to cloud cycle No infrastructure overhead Increased ML adoption across teams Governance & Tracking

28. Summary

29. Key Takeaways Delivery time reduced by 30% Decreased infrastructure costs by 60% Databricks used by 50% of all Atlassians Reduced Data team dependencies by more than 70%

30. Thank you!

31. Feedback Your feedback is important to us Don’t forget to rate and review the sessions

Scaling and Modernizing Data Platform with Databricks

Recommended

More Related Content

What's hot (20)

Similar to Scaling and Modernizing Data Platform with Databricks (20)

More from Databricks (20)

Recently uploaded (20)

Scaling and Modernizing Data Platform with Databricks