Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Building Modern Data Applications Using Databricks Lakehouse
Building Modern Data Applications Using Databricks Lakehouse

Building Modern Data Applications Using Databricks Lakehouse : Develop, optimize, and monitor data pipelines on Databricks

eBook
€23.99 €26.99
Paperback
€33.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Building Modern Data Applications Using Databricks Lakehouse

An Introduction to Delta Live Tables

In this chapter, we will examine how the data industry has evolved over the last several decades. We’ll also look at why real-time data processing has significant ties to how a business can react to the latest signals in data. We’ll address why trying to build your own streaming solution from scratch may not be sustainable, and why the maintenance does not easily scale over time. By the end of the chapter, you should completely understand the types of problems the Delta Live Tables (DLT) framework solves and the value the framework brings to data engineering teams.

In this chapter, we’re going to cover the following main topics:

  • The emergence of the lakehouse
  • The importance of real-time data in the lakehouse
  • The maintenance predicament of a streaming application
  • What is the Delta Live Tables framework?
  • How are Delta Live Tables related to Delta Lake?
  • An introduction to Delta Live Tables concepts...

Technical requirements

It’s recommended to have access to a Databricks premium workspace to follow along with the code examples at the end of the chapter. It’s also recommended to have Databricks workspace permissions to create an all-purpose cluster and a DLT pipeline using a cluster policy. Users will create and attach a notebook to a cluster and execute the notebook cells. All code samples can be downloaded from this chapter’s GitHub repository, located at https://github.com/PacktPublishing/Building-Modern-Data-Applications-Using-Databricks-Lakehouse/tree/main/chapter01. This chapter will create and run a new DLT pipeline using the Core product edition. As a result, the pipeline is estimated to consume around 5–10 Databricks Units (DBUs).

The emergence of the lakehouse

During the early 1980s, the data warehouse was a great tool for processing structured data. Combined with the right indexing methods, data warehouses allowed us to serve business intelligence (BI) reports at blazing speeds. However, after the turn of the century, data warehouses could not keep up with newer data formats such as JSON, as well as new data modalities such as audio and video. Simply put, data warehouses struggled to process semi-structured and unstructured data that most businesses used. Additionally, data warehouses struggled to scale to millions or billions of rows, common in the new information era of the early 2000s. Overnight, batch data processing jobs soon ran into BI reports scheduled to refresh during the early morning business hours.

At the same time, cloud computing became a popular choice among organizations because it provided enterprises with an elastic computing capacity that could quickly grow or shrink, based on the current...

The maintenance predicament of a streaming application

Spark Structured Streaming provides near-real-time stream processing with fault tolerance, and exactly-once processing guarantees through the use of a DataFrame API that is near-identical to batch processing in Spark. As a result of a common DataFrame API, data engineering teams can convert existing batch Spark workloads to streaming with minimal effort. However, as the volume of data increases and the number of ingestion sources and data pipelines naturally grows over time, data engineering teams face the burden of augmenting existing data pipelines to keep up with new data transformations or changing business logic. In addition, Spark Streaming comes with additional configuration maintenance such as updating checkpoint locations, managing watermarks and triggers, and even backfilling tables when a significant data change or data correction occurs. Advanced data engineering teams may even be expected to build data validation and...

What is the DLT framework?

DLT is a declarative framework that aims to simplify the development and maintenance operations of a data pipeline by abstracting away a lot of the boilerplate complexities. For example, rather than declaring how to transform, enrich, and validate data, data engineers can declare what transformations to apply to newly arriving data. Furthermore, DLT provides support to enforce data quality, preventing a data lake from becoming a data swamp. DLT gives data teams the ability to choose how to handle poor-quality data, whether that means printing a warning message to the system logs, dropping invalid data, or failing a data pipeline run altogether. Lastly, DLT automatically handles the mundane data engineering tasks of maintaining optimized data file sizes of the underlying tables, as well as cleaning up obsolete data files that are no longer present in the Delta transaction log (Optimize and Vacuum operations are covered later in the A quick Delta Lake primer...

How is DLT related to Delta Lake?

The DLT framework relies heavily on the Delta Lake format to incrementally process data at every step of the way. For example, streaming tables and materialized views defined in a DLT pipeline are backed by a Delta table. Features that make Delta Lake an ideal storage format for a streaming pipeline include support for Atomicity, Consistency, Isolation, and Durability (ACID) transactions so that concurrent data modifications such as inserts, updates, and deletions can be incrementally applied to a streaming table. Plus, Delta Lake features scalable metadata handling, allowing Delta Lake to easily scale to petabytes and beyond. If there is incorrect data computation, Delta Lake offers time travel – the ability to restore a copy of a table to a previous snapshot. Lastly, Delta Lake inherently tracks audit information in each table’s transaction log. Provenance information such as what type of operation modified the table, by what cluster...

Introducing DLT concepts

The DLT framework automatically manages task orchestration, cluster creation, and exception handling, allowing data engineers to focus on defining transformations, data enrichment, and data validation logic. Data engineers will define a data pipeline using one or more dataset types. Under the hood, the DLT system will determine how to keep these datasets up to date. A data pipeline using the DLT framework is made up of the streaming tables, materialized views, and views dataset types, which we’ll discuss in detail in the following sections. We’ll also briefly discuss how to visualize the pipeline, view its triggering method, and look at the entire pipeline data flow from a bird’s-eye view. We’ll also briefly understand the different types of Databricks compute and runtime, and Unity Catalog. Let’s go ahead and get started.

Streaming tables

Streaming tables leverage the benefits of Delta Lake and Spark Structured Streaming...

A quick Delta Lake primer

Delta Lake is a big data file protocol built around a multi-version transaction log that provides features such as ACID transactions, schema enforcement, time travel, data file management, and other performance features on top of existing data files in a lakehouse.

Originally, big data architectures had many concurrent processes that both read and modified data, leading to data corruption and even data loss. As previously mentioned, a two-pronged Lambda architecture was created, providing a layer of isolation between processes that applied streaming updates to data and downstream processes that needed a consistent snapshot of the data, such as BI workloads that generated daily reports or refreshed dashboards. However, these Lambda architectures duplicated data to support these batch and streaming workloads, leading to inconsistent data changes that needed to be reconciled at the end of each business day.

Fortunately, the Delta Lake format provides a...

A hands-on example – creating your first Delta Live Tables pipeline

In this section, we’ll use a NYC taxi sample dataset to declare a data pipeline, using the DLT framework, and apply a basic transformation to enrich the data.

Important note

To get the most value out of this section, it’s recommended to have Databricks workspace permissions to create an all-purpose cluster and a DLT pipeline, using a cluster policy. In this section, you will attach a notebook to a cluster, execute notebook cells, as well as create and run a new DLT pipeline.

Let’s start by creating a new all-purpose cluster. Navigate to the Databricks Compute UI by selecting the Compute button from the sidebar navigation on the left side.

Figure 1.7 – Navigate to the Compute UI from the left-hand sidebar

Figure 1.7 – Navigate to the Compute UI from the left-hand sidebar

Click the button titled Create compute at the top right. Next, provide a name for the cluster. For this exercise, the cluster can be a small...

Summary

In this chapter, we examined how and why the data industry has settled on a lakehouse architecture, which aims to merge the scalability of ETL processing and the fast data warehousing speeds for BI workloads under a single, unified architecture. We learned how real-time data processing is essential to uncovering value from the latest data as soon as it arrives, but real-time data pipelines can halt the productivity of data engineering teams as complexity grows over time. Finally, we learned the core concepts of the Delta Live Tables framework and how, with just a few lines of PySpark code and function decorators, we can quickly declare a real-time data pipeline that is capable of incrementally processing data with high throughput and low latency.

In the next chapter, we’ll take a deep dive into the advanced settings of Delta Live Tables pipelines and how the framework will optimize the underlying datasets for us. Then, we’ll look at more advanced data transformations...

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn how to work with real-time data using Delta Live Tables
  • Unlock insights into the performance of data pipelines using Delta Live Tables
  • Apply your knowledge to Unity Catalog for robust data security and governance
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

With so many tools to choose from in today’s data engineering development stack as well as operational complexity, this often overwhelms data engineers, causing them to spend less time gleaning value from their data and more time maintaining complex data pipelines. Guided by a lead specialist solutions architect at Databricks with 10+ years of experience in data and AI, this book shows you how the Delta Live Tables framework simplifies data pipeline development by allowing you to focus on defining input data sources, transformation logic, and output table destinations. This book gives you an overview of the Delta Lake format, the Databricks Data Intelligence Platform, and the Delta Live Tables framework. It teaches you how to apply data transformations by implementing the Databricks medallion architecture and continuously monitor the data quality of your pipelines. You’ll learn how to handle incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll master how to recover from runtime errors automatically. By the end of this book, you’ll be able to build a real-time data pipeline from scratch using Delta Live Tables, leverage CI/CD tools to deploy data pipeline changes automatically across deployment environments, and monitor, control, and optimize cloud costs.

Who is this book for?

This book is for data engineers looking to streamline data ingestion, transformation, and orchestration tasks. Data analysts responsible for managing and processing lakehouse data for analysis, reporting, and visualization will also find this book beneficial. Additionally, DataOps/DevOps engineers will find this book helpful for automating the testing and deployment of data pipelines, optimizing table tasks, and tracking data lineage within the lakehouse. Beginner-level knowledge of Apache Spark and Python is needed to make the most out of this book.

What you will learn

  • Deploy near-real-time data pipelines in Databricks using Delta Live Tables
  • Orchestrate data pipelines using Databricks workflows
  • Implement data validation policies and monitor/quarantine bad data
  • Apply slowly changing dimensions (SCD), Type 1 and 2, data to lakehouse tables
  • Secure data access across different groups and users using Unity Catalog
  • Automate continuous data pipeline deployment by integrating Git with build tools such as Terraform and Databricks Asset Bundles

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 31, 2024
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781804612873
Vendor :
Databricks
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Oct 31, 2024
Length: 246 pages
Edition : 1st
Language : English
ISBN-13 : 9781804612873
Vendor :
Databricks
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Table of Contents

15 Chapters
Part 1:Near-Real-Time Data Pipelines for the Lakehouse Chevron down icon Chevron up icon
Chapter 1: An Introduction to Delta Live Tables Chevron down icon Chevron up icon
Chapter 2: Applying Data Transformations Using Delta Live Tables Chevron down icon Chevron up icon
Chapter 3: Managing Data Quality Using Delta Live Tables Chevron down icon Chevron up icon
Chapter 4: Scaling DLT Pipelines Chevron down icon Chevron up icon
Part 2:Securing the Lakehouse Using the Unity Catalog Chevron down icon Chevron up icon
Chapter 5: Mastering Data Governance in the Lakehouse with Unity Catalog Chevron down icon Chevron up icon
Chapter 6: Managing Data Locations in Unity Catalog Chevron down icon Chevron up icon
Chapter 7: Viewing Data Lineage Using Unity Catalog Chevron down icon Chevron up icon
Part 3:Continuous Integration, Continuous Deployment, and Continuous Monitoring Chevron down icon Chevron up icon
Chapter 8: Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform Chevron down icon Chevron up icon
Chapter 9: Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment Chevron down icon Chevron up icon
Chapter 10: Monitoring Data Pipelines in Production Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(1 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Joao Almeida Jul 25, 2025
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great book - covers a lot of the essentials
Feefo Verified review Feefo
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.