SlideShare a Scribd company logo
SPARK SQL
Xinh Huynh
Women in Big Data training workshop
August, 2016
Audience poll
https://commons.wikimedia.org/wiki/File:PEO-happy_person_raising_one_hand.svg
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
Why learn Spark SQL?
• Most popular component in Spark
• Spark Survey 2015
• Use cases
• ETL
• Analytics
• Feature Extraction for machine learning
% of users
0 18 35 53 70
Spark SQL
DataFrames
MLlib, GraphX
Streaming
Use case: ETL & analytics
• Example: restaurant finder app
• Log data: Timestamp, UserID, Location, RestaurantType
• [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ]
• Analytics
• What time of day do users use the app?
• What is the most popular restaurant type in San Jose, CA?
Logs ETL Analytics
Spark SQL Spark SQL
How Spark SQL fits into Spark (2.0)
Spark Core (RDD)
Catalyst
SQL DataFrame / Dataset
ML Pipelines
Structured
Streaming
GraphFrames
Spark SQL
http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
Spark SQL programming interfaces
Catalyst
SQL DataFrame / Dataset
Spark SQL
SQL Scala, Java, R, Python Scala, Java
SQL or DataFrame?
• Use SQL if you are already familiar with SQL
• Use DataFrame
• To write queries in a general-purpose programming language
(Scala, Python, …).
• Use DataFrame to catch syntax errors earlier:
SQL DataFrame
Syntax Error
Example
“SELEECT id FROM table” df.seleect(“id”)
Caught at Runtime Compile Time
Loading and examining a table, Query with SQL
• See Notebook: http://tinyurl.com/spark-nb1
Setup for Hands-on Training
1. Sign on to WiFi with your assigned access code
1. See slip of paper in front of your seat
2. Sign in to https://community.cloud.databricks.com/
3. Go to "Clusters" and create a Spark 2.0 cluster
1. This may take a minute.
4. Go to “Workspace” -> Users -> Home -> Create ->
Notebook
1. Select Language = Scala
2. Create
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
DataFrame API
• See notebook: http://tinyurl.com/spark-nb2
Lazy Execution
• DataFrame operations are lazy
• Work is delayed until the last possible moment
• Transformations: DF -> DF
• select, groupBy; no computation done
• Actions: DF -> console or disk output
• show, collect, count, write; computation is done
https://www.flickr.com/photos/mtch3l/24491625352
Lazy Execution Example
1. val df1 = df.select(…)
2. val df2 = df1.groupBy (…)
3. .sum()
4. if (cond)
5. df2.show()
• Benefits of laziness
• Query optimization across lines 1-3
• If step 5 is not executed, then no unnecessary work was done
Transformation: no
computation done
Transformation: no
computation done
Action: performs the
select, groupBy at this
time, then shows the
results
Caching
• When querying the same data set over and over, caching it
in memory may speed up queries.
• Back to notebook …
Disk Memory Results
Memory Results
Without
caching:
With
caching:
Outline
• Part 1: Spark SQL Overview, SQL Queries
• Part 2: DataFrame Queries
• Part 3: Additional DataFrame Functions
Use case: Feature Extraction for ML
• Example: restaurant finder app
• Log data: Timestamp, UserID, Location, RestaurantType
• [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ]
• Machine Learning to train a model of user preferences
• Use Spark SQL to extract features for the model
• Example features: hour of day, distance to a restaurant, restaurant
type
Logs ETL Features ML Training
Spark SQL Spark SQL
See Notebook …
Functions for DataFrames
• See notebook: http://tinyurl.com/spark-nb3
Dataset (new in 2.0)
• DataFrames are untyped
• df.select($”col1” + 3)
• Useful when exploring new data
• Datasets are typed
• Dataset[T]
• Associates an object of type T with each row
• Catches type mismatches at compile time
• DataFrame = Dataset[Row]
• A DataFrame is one specific type of Dataset[T]
case class FarmersMarket(FMID: Int, MarketName: String)
val ds : Dataset[FarmersMarket] …
Numerical type assumed, but
not checked at compile time
Review
• Part 1: Spark SQL Overview, SQL Queries √
• Part 2: DataFrame Queries √
• Part 3: Additional DataFrame Functions √
References
• Spark SQL: http://spark.apache.org/docs/latest/sql-
programming-guide.html
• Spark Scala API docs: http://spark.apache.org/docs/latest/
api/scala/index.html#org.apache.spark.package
• Overview of DataFrames: http://
xinhstechblog.blogspot.com/2016/05/overview-of-spark-
dataframe-api.html
• Questions, comments:
• Spark user list: user@spark.apache.org
• Xinh’s contact: https://www.linkedin.com/in/xinh-huynh-317608
• Women in Big Data: https://www.womeninbigdata.org/

More Related Content

PPTX
Spark SQL
PDF
Spark sql
PPTX
Spark Sql for Training
PPTX
Optimizing Apache Spark SQL Joins
PDF
20170126 big data processing
PPTX
Apache Spark sql
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Spark SQL
Spark SQL
Spark sql
Spark Sql for Training
Optimizing Apache Spark SQL Joins
20170126 big data processing
Apache Spark sql
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL

What's hot (20)

PDF
New Developments in Spark
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Jaws - Data Warehouse with Spark SQL by Ema Orhian
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Spark etl
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PPTX
Spark sql
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PPTX
Building a modern Application with DataFrames
PPTX
Spark meetup v2.0.5
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
DataEngConf SF16 - Spark SQL Workshop
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Spark SQL - 10 Things You Need to Know
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
New Developments in Spark
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark etl
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark sql
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Building a modern Application with DataFrames
Spark meetup v2.0.5
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Using Apache Spark as ETL engine. Pros and Cons
DataEngConf SF16 - Spark SQL Workshop
Alpine academy apache spark series #1 introduction to cluster computing wit...
Spark SQL - 10 Things You Need to Know
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Ad

Similar to Introduction to Spark SQL training workshop (20)

PPTX
Visibility-from web application interface to the database
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
PPTX
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
PPTX
SQL Explore 2012: P&T Part 1
PPTX
Key to optimal end user experience
PPTX
SharePoint 2013 Performance Analysis - Robi Vončina
PPTX
CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
PDF
Observability with Spring-based distributed systems
PDF
Internals of Presto Service
PPTX
REST Api Tips and Tricks
PPTX
SplunkLive! Advanced Session
PDF
Advanced Benchmarking at Parse
PDF
Benchmarking at Parse
PPTX
Server and application monitoring webinars [Applications Manager] - Part 2
PDF
Web analytics at scale with Druid at naver.com
PPTX
Server and application monitoring webinars [Applications Manager]: Part 1
PPTX
Share point 2013 enterprise search (public)
PPTX
Building high performance and scalable share point applications
PDF
CCI2018 - Real-time dashboard whatif analysis
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Visibility-from web application interface to the database
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Explore 2012: P&T Part 1
Key to optimal end user experience
SharePoint 2013 Performance Analysis - Robi Vončina
CCI2019 - Monitorare SQL Server Senza Andare in Bancarotta
Observability with Spring-based distributed systems
Internals of Presto Service
REST Api Tips and Tricks
SplunkLive! Advanced Session
Advanced Benchmarking at Parse
Benchmarking at Parse
Server and application monitoring webinars [Applications Manager] - Part 2
Web analytics at scale with Druid at naver.com
Server and application monitoring webinars [Applications Manager]: Part 1
Share point 2013 enterprise search (public)
Building high performance and scalable share point applications
CCI2018 - Real-time dashboard whatif analysis
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Ad

Recently uploaded (20)

PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Transform Your Business with a Software ERP System
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
top salesforce developer skills in 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Essential Infomation Tech presentation.pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
Reimagine Home Health with the Power of Agentic AI​
Transform Your Business with a Software ERP System
Operating system designcfffgfgggggggvggggggggg
top salesforce developer skills in 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Understanding Forklifts - TECH EHS Solution
Adobe Illustrator 28.6 Crack My Vision of Vector Design
2025 Textile ERP Trends: SAP, Odoo & Oracle
wealthsignaloriginal-com-DS-text-... (1).pdf
Odoo POS Development Services by CandidRoot Solutions
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Essential Infomation Tech presentation.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
How to Migrate SBCGlobal Email to Yahoo Easily

Introduction to Spark SQL training workshop

  • 1. SPARK SQL Xinh Huynh Women in Big Data training workshop August, 2016
  • 3. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 4. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 5. Why learn Spark SQL? • Most popular component in Spark • Spark Survey 2015 • Use cases • ETL • Analytics • Feature Extraction for machine learning % of users 0 18 35 53 70 Spark SQL DataFrames MLlib, GraphX Streaming
  • 6. Use case: ETL & analytics • Example: restaurant finder app • Log data: Timestamp, UserID, Location, RestaurantType • [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ] • Analytics • What time of day do users use the app? • What is the most popular restaurant type in San Jose, CA? Logs ETL Analytics Spark SQL Spark SQL
  • 7. How Spark SQL fits into Spark (2.0) Spark Core (RDD) Catalyst SQL DataFrame / Dataset ML Pipelines Structured Streaming GraphFrames Spark SQL http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120
  • 8. Spark SQL programming interfaces Catalyst SQL DataFrame / Dataset Spark SQL SQL Scala, Java, R, Python Scala, Java
  • 9. SQL or DataFrame? • Use SQL if you are already familiar with SQL • Use DataFrame • To write queries in a general-purpose programming language (Scala, Python, …). • Use DataFrame to catch syntax errors earlier: SQL DataFrame Syntax Error Example “SELEECT id FROM table” df.seleect(“id”) Caught at Runtime Compile Time
  • 10. Loading and examining a table, Query with SQL • See Notebook: http://tinyurl.com/spark-nb1
  • 11. Setup for Hands-on Training 1. Sign on to WiFi with your assigned access code 1. See slip of paper in front of your seat 2. Sign in to https://community.cloud.databricks.com/ 3. Go to "Clusters" and create a Spark 2.0 cluster 1. This may take a minute. 4. Go to “Workspace” -> Users -> Home -> Create -> Notebook 1. Select Language = Scala 2. Create
  • 12. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 13. DataFrame API • See notebook: http://tinyurl.com/spark-nb2
  • 14. Lazy Execution • DataFrame operations are lazy • Work is delayed until the last possible moment • Transformations: DF -> DF • select, groupBy; no computation done • Actions: DF -> console or disk output • show, collect, count, write; computation is done https://www.flickr.com/photos/mtch3l/24491625352
  • 15. Lazy Execution Example 1. val df1 = df.select(…) 2. val df2 = df1.groupBy (…) 3. .sum() 4. if (cond) 5. df2.show() • Benefits of laziness • Query optimization across lines 1-3 • If step 5 is not executed, then no unnecessary work was done Transformation: no computation done Transformation: no computation done Action: performs the select, groupBy at this time, then shows the results
  • 16. Caching • When querying the same data set over and over, caching it in memory may speed up queries. • Back to notebook … Disk Memory Results Memory Results Without caching: With caching:
  • 17. Outline • Part 1: Spark SQL Overview, SQL Queries • Part 2: DataFrame Queries • Part 3: Additional DataFrame Functions
  • 18. Use case: Feature Extraction for ML • Example: restaurant finder app • Log data: Timestamp, UserID, Location, RestaurantType • [ 4/24/2014 6:22:51 PM, 1000618, -85.5750, 42.2959, Pizza ] • Machine Learning to train a model of user preferences • Use Spark SQL to extract features for the model • Example features: hour of day, distance to a restaurant, restaurant type Logs ETL Features ML Training Spark SQL Spark SQL See Notebook …
  • 19. Functions for DataFrames • See notebook: http://tinyurl.com/spark-nb3
  • 20. Dataset (new in 2.0) • DataFrames are untyped • df.select($”col1” + 3) • Useful when exploring new data • Datasets are typed • Dataset[T] • Associates an object of type T with each row • Catches type mismatches at compile time • DataFrame = Dataset[Row] • A DataFrame is one specific type of Dataset[T] case class FarmersMarket(FMID: Int, MarketName: String) val ds : Dataset[FarmersMarket] … Numerical type assumed, but not checked at compile time
  • 21. Review • Part 1: Spark SQL Overview, SQL Queries √ • Part 2: DataFrame Queries √ • Part 3: Additional DataFrame Functions √
  • 22. References • Spark SQL: http://spark.apache.org/docs/latest/sql- programming-guide.html • Spark Scala API docs: http://spark.apache.org/docs/latest/ api/scala/index.html#org.apache.spark.package • Overview of DataFrames: http:// xinhstechblog.blogspot.com/2016/05/overview-of-spark- dataframe-api.html • Questions, comments: • Spark user list: [email protected] • Xinh’s contact: https://www.linkedin.com/in/xinh-huynh-317608 • Women in Big Data: https://www.womeninbigdata.org/