Understanding the SQL aspects of Spark - Spark SQL.pptx

Exploring SQL
Aspects of Spark
Raviyanshu, Ayush

Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.

1. Apache Spark
2. Spark SQL Intro
3. Spark SQL Architecture
4. Spark SQL Features
5. Challenges & Limitation with Spark SQL
6. Advance SQL Operations
 Window
 JOINS
 UNION
 PIVOT
 UDF
7. Demo

What is Apache Spark?
 Open-source, distributed computing system designed
for processing big data
 Core Components: Resilient Distributed Dataset(RDD),
DataFrames, Datasets
 Widely adopted for its speed, scalability, and ease of
use

Spark SQL in Spark
 Module in Spark that provide unified SQL interface to
large scale datasets
 Works for both data world – structured & unstructured
 Spark SQL allows developers to issue ANSI SQL:2003 –
compatible queries
 Provides a bridge to (and from) external tools via
standard database JDBC/ ODBC connectors

 At the core Spark SQL are -
− Catalyst optimizer
− Project Tungsten
 Together they support high level -
− DataFrame
− Datasets
− SQL Queries
 Execution Engine is responsible for executing the
optimized SQL queries on the cluster.
 Result Handler collects the results of the executed
queries and return them to the user.
Spark SQL Architecture

Spark SQL Arch...
The Catalyst Optimizer

Spark SQL Arch...
The Tungsten Execution Engine
• Tungsten is an execution engine in Apache Spark that focuses on improving the memory and CPU
efficiency of Spark applications.
• Responsible for efficiently executing the physical execution plan generated by Catalyst.
• Key Features of Tungsten Execution Engine are:
 Memory Management
 Code Generation
 Whole-Stage Code Generation
• Memory
• M
• Mem

02
01
Spark SQL Features
03
Combining two or more datasets
on a common set of column.
Eg: Inner, Outer, Left, Broadcast
etc
Joins
04
Involving summarizing and
condensing large datasets into
manangeable and insightful
forms
Aggregations
Functions that operate on as set
of row, and return a single value
for each row.
Eg: Ranking, Analytic,
Aggregate
Window Functions
Data Definition Language used
for create, drop, and describe
tables
DDL

Tackling Challenged & Limitations
01 02
03
05 06
04
Common problem in spark SQL where
data is not evenly distributed leading
to performance issues and can be
addressed by partitioning and
shuffling
Data Skew
Slow for certain queries such as joins
and aggregations which can be handle
by using broadcast joins or coalesce.
Performance
Event time, processing, late data
arrival, and watermarking to ensure
accurate results.
Streaming Data Challenges
This can happen when the data is too
large or when the queries are too
complex. To address out-of-memory
errors, you can use a smaller data set
or using a distributed file system.
Out-of-memory errors
Debugging Complexity
Spark SQL can consume significant
memory and CPU resources, especially
for large-scale data processing. It's
important to properly configure cluster
resources.
Resources Usages
.
Debugging SQL queries in Spark can be
challenging, especially when dealing with
complex transformations.

Advance SQL Operations
JOINS
Joins in SQL is an operation that combines rows from two or more tables based on a related column
between them. It is used to retrieve data from multiple tables in a single result set, allowing you to
combine information from different sources.

UNION
In Spark SQL, you can use the UNION operation to combine the results of two or more SELECT
statements or DataFrames into a single result set.
Condition for performing union -
• Column Count and Data Types: The SELECT statements or DataFrames being combined with
UNION must have the same number of columns, and the corresponding columns must have
compatible data types. Spark SQL performs type checking to ensure that the columns align.
• Column Order: The columns must be in the same order in all SELECT statements or DataFrames.
This means that the first column in the first SELECT statement or DataFrame should correspond to the
first column in the subsequent SELECT statements or DataFrames.
• No Duplicate Rows: By default, UNION removes duplicate rows from the result set. If you want to
include duplicate rows, you can use UNION ALL instead of UNION.

Windows
Window functions allow you to perform calculations over a "window" of rows that are defined by an
ordered range of rows within the result set. These functions are particularly useful for tasks like
calculating moving averages, rankings, percentiles, and more.

PIVOT
In Spark SQL, "pivoting" refers to a transformation operation that restructures or reorganizes data from a
long format to a wide format.
When working with your data, sometimes you will need to swap the columns for the rows i.e. pivot your
data.

UDF
In Spark SQL, "UDF" typically refers to "User-Defined Function." User-Defined Functions allow you to
define your custom functions in Spark SQL, which can then be used in SQL queries and DataFrame
operations. UDFs are especially useful when you need to perform custom transformations on your data
that aren't easily achievable with built-in Spark functions.

Understanding the SQL aspects of Spark - Spark SQL.pptx

More Related Content

Similar to Understanding the SQL aspects of Spark - Spark SQL.pptx (20)

More from Knoldus Inc. (20)

Recently uploaded (20)

Understanding the SQL aspects of Spark - Spark SQL.pptx