SlideShare a Scribd company logo
Basic
Using Spark DataFrame
For SQL
charsyam@naver.com
Create DataFrame From File
val path = “abc.txt”
val df = spark.read.text(path)
Create DataFrame From Kafka
val rdd = KafkaUtils.createRDD[String, String](...)
val logsDF = rdd.map { _.value }.toDF
Spark DataFrame Column
1) col("column name")
2) $"column name"
1) And 2) are the same.
Simple Iris TSV Logs
http://www.math.uah.edu/stat/data/Fisher.txt
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
1 19 51 27 58
Load TSV with StructType
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))
Load TSV with Encoder #1
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
Load TSV
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("Fisher.txt")
irisDf.show(5)
Load TSV - Show Results
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows
Using sqlContext sql
Super easy way
val view = df.createOrReplaceTempView("tmp_iris")
val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
Simple Select
SQL:
Select type, petalwidth + sepalwidth as sum_width from …
val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth"))
val resultDF = sumDF.selectExpr("Type", "sum_width")
val resultDF = sumDF.selectExpr("*") ← select *
Select with where
SQL:
Select type, petalwidth from … where petalwidth > 10
val whereDF = df.filter($"petalwidth" > 10)
val whereDF = df.where($"petalwidth" > 10)
//filter and where are the same
val resultDF = whereDF.selectExpr("Type", "petalwidth")
Select with order by
SQL:
Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc
1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc)
2) val sortDF = df.sort($"petalwidth", desc("sepalwidth"))
3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth"))
1), 2) And 3) are the same.
val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
Select with Group by
SQL:
Select type, max(petalwidth) A, min(sepalwidth) B from … group by type
val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"),
min($"sepalwidth").as("B"))
val resultDF = groupDF.selectExpr("type", "A", "B")
Tip - Support MapType<String, String> like Hive
SQL in Hive:
Create table test (type map<string, string>);
Hive support str_to_map, but spark not support for dataframe(spark support
str_to_map for hiveQL).
Using udf to solve this.
val string_line = "A=1,B=2,C=3"
Val df = logsDF.withColumn("type", str_to_map(string_line))
UDF - str_to_map
val str_to_map = udf {
text : String =>
val pairs = text.split("delimiter1|delimiter2").grouped(2)
pairs.map { case Array(k, v) => k -> v}.toMap
}
Thank you.
Ad

Recommended

Javascript Arrays
Javascript Arrays
shaheenakv
 
Xm lparsers
Xm lparsers
Suman Lata
 
Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and Couchbase
Brant Burnett
 
The Ring programming language version 1.2 book - Part 26 of 84
The Ring programming language version 1.2 book - Part 26 of 84
Mahmoud Samir Fayed
 
Apache Spark - Aram Mkrtchyan
Apache Spark - Aram Mkrtchyan
Hovhannes Kuloghlyan
 
Hidden Gems in Swift
Hidden Gems in Swift
Netguru
 
Database testing in postgresql query
Database testing in postgresql query
mohammed najim
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor
 
Format xls sheets Demo Mode
Format xls sheets Demo Mode
Jared Bourne
 
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189
Mahmoud Samir Fayed
 
The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84
Mahmoud Samir Fayed
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
HyeonSeok Choi
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212
Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31
Mahmoud Samir Fayed
 
JSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger picture
Sergey Petrunya
 
Rule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer Toolkit
Aaron Parecki
 
Get docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
GreenDao Introduction
GreenDao Introduction
Booch Lin
 
The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196
Mahmoud Samir Fayed
 
Memory management
Memory management
Kuban Dzhakipov
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
Node js mongodriver
Node js mongodriver
christkv
 
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210
Mahmoud Samir Fayed
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Rebecca Grenier
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184
Mahmoud Samir Fayed
 
Odoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 

More Related Content

What's hot (20)

Format xls sheets Demo Mode
Format xls sheets Demo Mode
Jared Bourne
 
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189
Mahmoud Samir Fayed
 
The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84
Mahmoud Samir Fayed
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
HyeonSeok Choi
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212
Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31
Mahmoud Samir Fayed
 
JSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger picture
Sergey Petrunya
 
Rule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer Toolkit
Aaron Parecki
 
Get docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
GreenDao Introduction
GreenDao Introduction
Booch Lin
 
The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196
Mahmoud Samir Fayed
 
Memory management
Memory management
Kuban Dzhakipov
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
Node js mongodriver
Node js mongodriver
christkv
 
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210
Mahmoud Samir Fayed
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Rebecca Grenier
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184
Mahmoud Samir Fayed
 
Odoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 
Format xls sheets Demo Mode
Format xls sheets Demo Mode
Jared Bourne
 
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189
Mahmoud Samir Fayed
 
The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84
Mahmoud Samir Fayed
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
HyeonSeok Choi
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212
Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31
Mahmoud Samir Fayed
 
JSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger picture
Sergey Petrunya
 
Rule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer Toolkit
Aaron Parecki
 
Get docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
GreenDao Introduction
GreenDao Introduction
Booch Lin
 
The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196
Mahmoud Samir Fayed
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
Node js mongodriver
Node js mongodriver
christkv
 
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210
Mahmoud Samir Fayed
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Rebecca Grenier
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184
Mahmoud Samir Fayed
 
Odoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 

Similar to Using spark data frame for sql (20)

Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Spark sql
Spark sql
Zahra Eskandari
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
SparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Spark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)
Alexis Seigneurin
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
phanleson
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
SparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Introduction to Spark SQL training workshop
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Spark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)
Alexis Seigneurin
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Ad

More from DaeMyung Kang (20)

Count min sketch
Count min sketch
DaeMyung Kang
 
Redis
Redis
DaeMyung Kang
 
Ansible
Ansible
DaeMyung Kang
 
Why GUID is needed
Why GUID is needed
DaeMyung Kang
 
How to use redis well
How to use redis well
DaeMyung Kang
 
The easiest consistent hashing
The easiest consistent hashing
DaeMyung Kang
 
How to name a cache key
How to name a cache key
DaeMyung Kang
 
Integration between Filebeat and logstash
Integration between Filebeat and logstash
DaeMyung Kang
 
How to build massive service for advance
How to build massive service for advance
DaeMyung Kang
 
Massive service basic
Massive service basic
DaeMyung Kang
 
Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
How To Become Better Engineer
How To Become Better Engineer
DaeMyung Kang
 
Kafka timestamp offset_final
Kafka timestamp offset_final
DaeMyung Kang
 
Kafka timestamp offset
Kafka timestamp offset
DaeMyung Kang
 
Data pipeline and data lake
Data pipeline and data lake
DaeMyung Kang
 
Redis acl
Redis acl
DaeMyung Kang
 
Coffee store
Coffee store
DaeMyung Kang
 
Scalable webservice
Scalable webservice
DaeMyung Kang
 
Number system
Number system
DaeMyung Kang
 
webservice scaling for newbie
webservice scaling for newbie
DaeMyung Kang
 
How to use redis well
How to use redis well
DaeMyung Kang
 
The easiest consistent hashing
The easiest consistent hashing
DaeMyung Kang
 
How to name a cache key
How to name a cache key
DaeMyung Kang
 
Integration between Filebeat and logstash
Integration between Filebeat and logstash
DaeMyung Kang
 
How to build massive service for advance
How to build massive service for advance
DaeMyung Kang
 
Massive service basic
Massive service basic
DaeMyung Kang
 
Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
How To Become Better Engineer
How To Become Better Engineer
DaeMyung Kang
 
Kafka timestamp offset_final
Kafka timestamp offset_final
DaeMyung Kang
 
Kafka timestamp offset
Kafka timestamp offset
DaeMyung Kang
 
Data pipeline and data lake
Data pipeline and data lake
DaeMyung Kang
 
webservice scaling for newbie
webservice scaling for newbie
DaeMyung Kang
 
Ad

Recently uploaded (20)

ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 

Using spark data frame for sql

  • 2. Create DataFrame From File val path = “abc.txt” val df = spark.read.text(path)
  • 3. Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF
  • 4. Spark DataFrame Column 1) col("column name") 2) $"column name" 1) And 2) are the same.
  • 5. Simple Iris TSV Logs http://www.math.uah.edu/stat/data/Fisher.txt Type PW PL SW SL 0 2 14 33 50 1 24 56 31 67 1 23 51 31 69 0 2 10 36 46 1 20 52 30 65 1 19 51 27 58
  • 6. Load TSV with StructType import org.apache.spark.sql.types._ var irisSchema = StructType(Array( StructField("Type", IntegerType, true), StructField("PetalWidth", IntegerType, true), StructField("PetalLength", IntegerType, true), StructField("SepalWidth", IntegerType, true), StructField("SepalLength", IntegerType, true) ))
  • 7. Load TSV with Encoder #1 import org.apache.spark.sql.Encoders case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, SepalWidth: Int, SepalLength: Int) var irisSchema = Encoders.product[IrisSchema].schema
  • 8. Load TSV var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV. option("header", "true"). // Does the file have a header line? option("delimiter", "t"). // Set delimiter to tab or comma. schema(irisSchema). // Schema that was built above. load("Fisher.txt") irisDf.show(5)
  • 9. Load TSV - Show Results scala> irisDf.show(5) +----+----------+-----------+----------+-----------+ |Type|PetalWidth|PetalLength|SepalWidth|SepalLength| +----+----------+-----------+----------+-----------+ | 0| 2| 14| 33| 50| | 1| 24| 56| 31| 67| | 1| 23| 51| 31| 69| | 0| 2| 10| 36| 46| | 1| 20| 52| 30| 65| +----+----------+-----------+----------+-----------+ only showing top 5 rows
  • 10. Using sqlContext sql Super easy way val view = df.createOrReplaceTempView("tmp_iris") val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
  • 11. Simple Select SQL: Select type, petalwidth + sepalwidth as sum_width from … val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth")) val resultDF = sumDF.selectExpr("Type", "sum_width") val resultDF = sumDF.selectExpr("*") ← select *
  • 12. Select with where SQL: Select type, petalwidth from … where petalwidth > 10 val whereDF = df.filter($"petalwidth" > 10) val whereDF = df.where($"petalwidth" > 10) //filter and where are the same val resultDF = whereDF.selectExpr("Type", "petalwidth")
  • 13. Select with order by SQL: Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc 1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc) 2) val sortDF = df.sort($"petalwidth", desc("sepalwidth")) 3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth")) 1), 2) And 3) are the same. val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
  • 14. Select with Group by SQL: Select type, max(petalwidth) A, min(sepalwidth) B from … group by type val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"), min($"sepalwidth").as("B")) val resultDF = groupDF.selectExpr("type", "A", "B")
  • 15. Tip - Support MapType<String, String> like Hive SQL in Hive: Create table test (type map<string, string>); Hive support str_to_map, but spark not support for dataframe(spark support str_to_map for hiveQL). Using udf to solve this. val string_line = "A=1,B=2,C=3" Val df = logsDF.withColumn("type", str_to_map(string_line))
  • 16. UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }