SlideShare a Scribd company logo
Exploring SQL
Aspects of Spark
Raviyanshu, Ayush
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Apache Spark
2. Spark SQL Intro
3. Spark SQL Architecture
4. Spark SQL Features
5. Challenges & Limitation with Spark SQL
6. Advance SQL Operations
 Window
 JOINS
 UNION
 PIVOT
 UDF
7. Demo
Understanding the SQL aspects of Spark - Spark SQL.pptx
What is Apache Spark?
 Open-source, distributed computing system designed
for processing big data
 Core Components: Resilient Distributed Dataset(RDD),
DataFrames, Datasets
 Widely adopted for its speed, scalability, and ease of
use
Spark SQL in Spark
 Module in Spark that provide unified SQL interface to
large scale datasets
 Works for both data world – structured & unstructured
 Spark SQL allows developers to issue ANSI SQL:2003 –
compatible queries
 Provides a bridge to (and from) external tools via
standard database JDBC/ ODBC connectors
 At the core Spark SQL are -
− Catalyst optimizer
− Project Tungsten
 Together they support high level -
− DataFrame
− Datasets
− SQL Queries
 Execution Engine is responsible for executing the
optimized SQL queries on the cluster.
 Result Handler collects the results of the executed
queries and return them to the user.
Spark SQL Architecture
Spark SQL Arch...
The Catalyst Optimizer
Spark SQL Arch...
The Tungsten Execution Engine
• Tungsten is an execution engine in Apache Spark that focuses on improving the memory and CPU
efficiency of Spark applications.
• Responsible for efficiently executing the physical execution plan generated by Catalyst.
• Key Features of Tungsten Execution Engine are:
 Memory Management
 Code Generation
 Whole-Stage Code Generation
• Memory
• M
• Mem
02
01
Spark SQL Features
03
Combining two or more datasets
on a common set of column.
Eg: Inner, Outer, Left, Broadcast
etc
Joins
04
Involving summarizing and
condensing large datasets into
manangeable and insightful
forms
Aggregations
Functions that operate on as set
of row, and return a single value
for each row.
Eg: Ranking, Analytic,
Aggregate
Window Functions
Data Definition Language used
for create, drop, and describe
tables
DDL
Tackling Challenged & Limitations
01 02
03
05 06
04
Common problem in spark SQL where
data is not evenly distributed leading
to performance issues and can be
addressed by partitioning and
shuffling
Data Skew
Slow for certain queries such as joins
and aggregations which can be handle
by using broadcast joins or coalesce.
Performance
Event time, processing, late data
arrival, and watermarking to ensure
accurate results.
Streaming Data Challenges
This can happen when the data is too
large or when the queries are too
complex. To address out-of-memory
errors, you can use a smaller data set
or using a distributed file system.
Out-of-memory errors
Debugging Complexity
Spark SQL can consume significant
memory and CPU resources, especially
for large-scale data processing. It's
important to properly configure cluster
resources.
Resources Usages
.
Debugging SQL queries in Spark can be
challenging, especially when dealing with
complex transformations.
Advance SQL Operations
JOINS
Joins in SQL is an operation that combines rows from two or more tables based on a related column
between them. It is used to retrieve data from multiple tables in a single result set, allowing you to
combine information from different sources.
Advance SQL Operations
UNION
In Spark SQL, you can use the UNION operation to combine the results of two or more SELECT
statements or DataFrames into a single result set.
Condition for performing union -
• Column Count and Data Types: The SELECT statements or DataFrames being combined with
UNION must have the same number of columns, and the corresponding columns must have
compatible data types. Spark SQL performs type checking to ensure that the columns align.
• Column Order: The columns must be in the same order in all SELECT statements or DataFrames.
This means that the first column in the first SELECT statement or DataFrame should correspond to the
first column in the subsequent SELECT statements or DataFrames.
• No Duplicate Rows: By default, UNION removes duplicate rows from the result set. If you want to
include duplicate rows, you can use UNION ALL instead of UNION.
Advance SQL Operations
Windows
Window functions allow you to perform calculations over a "window" of rows that are defined by an
ordered range of rows within the result set. These functions are particularly useful for tasks like
calculating moving averages, rankings, percentiles, and more.
Advance SQL Operations
PIVOT
In Spark SQL, "pivoting" refers to a transformation operation that restructures or reorganizes data from a
long format to a wide format.
When working with your data, sometimes you will need to swap the columns for the rows i.e. pivot your
data.
Advance SQL Operations
UDF
In Spark SQL, "UDF" typically refers to "User-Defined Function." User-Defined Functions allow you to
define your custom functions in Spark SQL, which can then be used in SQL queries and DataFrame
operations. UDFs are especially useful when you need to perform custom transformations on your data
that aren't easily achievable with built-in Spark functions.
DEMO
Understanding the SQL aspects of Spark - Spark SQL.pptx

More Related Content

PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Introduction to Spark SQL training workshop
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PDF
Spark SQL In Depth www.syedacademy.com
PDF
20140908 spark sql & catalyst
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
This is training for spark SQL essential
Jump Start with Apache Spark 2.0 on Databricks
Introduction to Spark SQL training workshop
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL In Depth www.syedacademy.com
20140908 spark sql & catalyst
Introduction to Spark Datasets - Functional and relational together at last
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
This is training for spark SQL essential

Similar to Understanding the SQL aspects of Spark - Spark SQL.pptx (20)

PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PPTX
Big Data Transformations Powered By Spark
PPTX
Big Data Transformation Powered By Apache Spark.pptx
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Apche Spark SQL and Advanced Queries on big data
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
A look ahead at spark 2.0
PPTX
Spark Sql and DataFrame
PDF
Spark sql
PDF
Beyond SQL: Speeding up Spark with DataFrames
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PDF
Spark sql under the hood - Data KRK meetup
PPTX
Apache Spark sql
PPTX
2018 data warehouse features in spark
PDF
Understanding Query Plans and Spark UIs
Real-Time Spark: From Interactive Queries to Streaming
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Big Data Transformations Powered By Spark
Big Data Transformation Powered By Apache Spark.pptx
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Apche Spark SQL and Advanced Queries on big data
Jump Start into Apache® Spark™ and Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Apache Spark 2.0: Faster, Easier, and Smarter
An Insider’s Guide to Maximizing Spark SQL Performance
A look ahead at spark 2.0
Spark Sql and DataFrame
Spark sql
Beyond SQL: Speeding up Spark with DataFrames
Getting started with SparkSQL - Desert Code Camp 2016
Spark sql under the hood - Data KRK meetup
Apache Spark sql
2018 data warehouse features in spark
Understanding Query Plans and Spark UIs
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.
Ad

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPT
Teaching material agriculture food technology
PDF
Mushroom cultivation and it's methods.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TLE Review Electricity (Electricity).pptx
NewMind AI Weekly Chronicles - August'25-Week II
Teaching material agriculture food technology
Mushroom cultivation and it's methods.pdf
Unlocking AI with Model Context Protocol (MCP)
Tartificialntelligence_presentation.pptx
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Heart disease approach using modified random forest and particle swarm optimi...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...

Understanding the SQL aspects of Spark - Spark SQL.pptx

  • 1. Exploring SQL Aspects of Spark Raviyanshu, Ayush
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Apache Spark 2. Spark SQL Intro 3. Spark SQL Architecture 4. Spark SQL Features 5. Challenges & Limitation with Spark SQL 6. Advance SQL Operations  Window  JOINS  UNION  PIVOT  UDF 7. Demo
  • 5. What is Apache Spark?  Open-source, distributed computing system designed for processing big data  Core Components: Resilient Distributed Dataset(RDD), DataFrames, Datasets  Widely adopted for its speed, scalability, and ease of use
  • 6. Spark SQL in Spark  Module in Spark that provide unified SQL interface to large scale datasets  Works for both data world – structured & unstructured  Spark SQL allows developers to issue ANSI SQL:2003 – compatible queries  Provides a bridge to (and from) external tools via standard database JDBC/ ODBC connectors
  • 7.  At the core Spark SQL are - − Catalyst optimizer − Project Tungsten  Together they support high level - − DataFrame − Datasets − SQL Queries  Execution Engine is responsible for executing the optimized SQL queries on the cluster.  Result Handler collects the results of the executed queries and return them to the user. Spark SQL Architecture
  • 8. Spark SQL Arch... The Catalyst Optimizer
  • 9. Spark SQL Arch... The Tungsten Execution Engine • Tungsten is an execution engine in Apache Spark that focuses on improving the memory and CPU efficiency of Spark applications. • Responsible for efficiently executing the physical execution plan generated by Catalyst. • Key Features of Tungsten Execution Engine are:  Memory Management  Code Generation  Whole-Stage Code Generation • Memory • M • Mem
  • 10. 02 01 Spark SQL Features 03 Combining two or more datasets on a common set of column. Eg: Inner, Outer, Left, Broadcast etc Joins 04 Involving summarizing and condensing large datasets into manangeable and insightful forms Aggregations Functions that operate on as set of row, and return a single value for each row. Eg: Ranking, Analytic, Aggregate Window Functions Data Definition Language used for create, drop, and describe tables DDL
  • 11. Tackling Challenged & Limitations 01 02 03 05 06 04 Common problem in spark SQL where data is not evenly distributed leading to performance issues and can be addressed by partitioning and shuffling Data Skew Slow for certain queries such as joins and aggregations which can be handle by using broadcast joins or coalesce. Performance Event time, processing, late data arrival, and watermarking to ensure accurate results. Streaming Data Challenges This can happen when the data is too large or when the queries are too complex. To address out-of-memory errors, you can use a smaller data set or using a distributed file system. Out-of-memory errors Debugging Complexity Spark SQL can consume significant memory and CPU resources, especially for large-scale data processing. It's important to properly configure cluster resources. Resources Usages . Debugging SQL queries in Spark can be challenging, especially when dealing with complex transformations.
  • 12. Advance SQL Operations JOINS Joins in SQL is an operation that combines rows from two or more tables based on a related column between them. It is used to retrieve data from multiple tables in a single result set, allowing you to combine information from different sources.
  • 13. Advance SQL Operations UNION In Spark SQL, you can use the UNION operation to combine the results of two or more SELECT statements or DataFrames into a single result set. Condition for performing union - • Column Count and Data Types: The SELECT statements or DataFrames being combined with UNION must have the same number of columns, and the corresponding columns must have compatible data types. Spark SQL performs type checking to ensure that the columns align. • Column Order: The columns must be in the same order in all SELECT statements or DataFrames. This means that the first column in the first SELECT statement or DataFrame should correspond to the first column in the subsequent SELECT statements or DataFrames. • No Duplicate Rows: By default, UNION removes duplicate rows from the result set. If you want to include duplicate rows, you can use UNION ALL instead of UNION.
  • 14. Advance SQL Operations Windows Window functions allow you to perform calculations over a "window" of rows that are defined by an ordered range of rows within the result set. These functions are particularly useful for tasks like calculating moving averages, rankings, percentiles, and more.
  • 15. Advance SQL Operations PIVOT In Spark SQL, "pivoting" refers to a transformation operation that restructures or reorganizes data from a long format to a wide format. When working with your data, sometimes you will need to swap the columns for the rows i.e. pivot your data.
  • 16. Advance SQL Operations UDF In Spark SQL, "UDF" typically refers to "User-Defined Function." User-Defined Functions allow you to define your custom functions in Spark SQL, which can then be used in SQL queries and DataFrame operations. UDFs are especially useful when you need to perform custom transformations on your data that aren't easily achievable with built-in Spark functions.
  • 17. DEMO