Basic
Using Spark DataFrame
For SQL
charsyam@naver.com
Create DataFrame From File
val path = “abc.txt”
val df = spark.read.text(path)
Create DataFrame From Kafka
val rdd = KafkaUtils.createRDD[String, String](...)
val logsDF = rdd.map { _.value }.toDF
Spark DataFrame Column
1) col("column name")
2) $"column name"
1) And 2) are the same.
Simple Iris TSV Logs
http://www.math.uah.edu/stat/data/Fisher.txt
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
1 19 51 27 58
Load TSV with StructType
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))
Load TSV with Encoder #1
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
Load TSV
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("Fisher.txt")
irisDf.show(5)
Load TSV - Show Results
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows
Using sqlContext sql
Super easy way
val view = df.createOrReplaceTempView("tmp_iris")
val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
Simple Select
SQL:
Select type, petalwidth + sepalwidth as sum_width from …
val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth"))
val resultDF = sumDF.selectExpr("Type", "sum_width")
val resultDF = sumDF.selectExpr("*") ← select *
Select with where
SQL:
Select type, petalwidth from … where petalwidth > 10
val whereDF = df.filter($"petalwidth" > 10)
val whereDF = df.where($"petalwidth" > 10)
//filter and where are the same
val resultDF = whereDF.selectExpr("Type", "petalwidth")
Select with order by
SQL:
Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc
1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc)
2) val sortDF = df.sort($"petalwidth", desc("sepalwidth"))
3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth"))
1), 2) And 3) are the same.
val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
Select with Group by
SQL:
Select type, max(petalwidth) A, min(sepalwidth) B from … group by type
val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"),
min($"sepalwidth").as("B"))
val resultDF = groupDF.selectExpr("type", "A", "B")
Tip - Support MapType<String, String> like Hive
SQL in Hive:
Create table test (type map<string, string>);
Hive support str_to_map, but spark not support for dataframe(spark support
str_to_map for hiveQL).
Using udf to solve this.
val string_line = "A=1,B=2,C=3"
Val df = logsDF.withColumn("type", str_to_map(string_line))
UDF - str_to_map
val str_to_map = udf {
text : String =>
val pairs = text.split("delimiter1|delimiter2").grouped(2)
pairs.map { case Array(k, v) => k -> v}.toMap
}
Thank you.

More Related Content

PPTX
Javascript Arrays
PPT
Xm lparsers
PPTX
Querying Nested JSON Data Using N1QL and Couchbase
PDF
The Ring programming language version 1.2 book - Part 26 of 84
PPTX
Apache Spark - Aram Mkrtchyan
PDF
Hidden Gems in Swift
PPTX
Database testing in postgresql query
PDF
Avro, la puissance du binaire, la souplesse du JSON
Javascript Arrays
Xm lparsers
Querying Nested JSON Data Using N1QL and Couchbase
The Ring programming language version 1.2 book - Part 26 of 84
Apache Spark - Aram Mkrtchyan
Hidden Gems in Swift
Database testing in postgresql query
Avro, la puissance du binaire, la souplesse du JSON

What's hot (20)

DOCX
Format xls sheets Demo Mode
PDF
The Ring programming language version 1.6 book - Part 32 of 189
PDF
The Ring programming language version 1.2 book - Part 19 of 84
PDF
SICP_2.5 일반화된 연산시스템
PDF
The Ring programming language version 1.10 book - Part 47 of 212
PDF
The Ring programming language version 1.4.1 book - Part 13 of 31
PDF
JSON Support in MariaDB: News, non-news and the bigger picture
PPTX
Rule Your Geometry with the Terraformer Toolkit
PPTX
Get docs from sp doc library
PPTX
GreenDao Introduction
PDF
The Ring programming language version 1.7 book - Part 41 of 196
PDF
Memory management
PDF
The Ring programming language version 1.7 book - Part 48 of 196
KEY
Node js mongodriver
PDF
The Ring programming language version 1.5.3 book - Part 30 of 184
PDF
The Ring programming language version 1.9 book - Part 46 of 210
PPTX
Slick: Bringing Scala’s Powerful Features to Your Database Access
PDF
The Ring programming language version 1.5 book - Part 8 of 31
PDF
The Ring programming language version 1.5.3 book - Part 37 of 184
PDF
Odoo Technical Concepts Summary
Format xls sheets Demo Mode
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.2 book - Part 19 of 84
SICP_2.5 일반화된 연산시스템
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.4.1 book - Part 13 of 31
JSON Support in MariaDB: News, non-news and the bigger picture
Rule Your Geometry with the Terraformer Toolkit
Get docs from sp doc library
GreenDao Introduction
The Ring programming language version 1.7 book - Part 41 of 196
Memory management
The Ring programming language version 1.7 book - Part 48 of 196
Node js mongodriver
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.9 book - Part 46 of 210
Slick: Bringing Scala’s Powerful Features to Your Database Access
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5.3 book - Part 37 of 184
Odoo Technical Concepts Summary
Ad

Similar to Using spark data frame for sql (20)

PDF
Solr As A SparkSQL DataSource
PDF
ScalikeJDBC Tutorial for Beginners
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
ODP
Introduction to Spark with Scala
PPTX
Meetup spark structured streaming
PDF
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
PPTX
Testing batch and streaming Spark applications
PPT
3 database-jdbc(1)
PDF
Spark Summit EU talk by Ted Malaska
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
PDF
The Ring programming language version 1.5.4 book - Part 37 of 185
PDF
SparkSQLの構文解析
PDF
The Ring programming language version 1.5.3 book - Part 54 of 184
PDF
The Ring programming language version 1.5.3 book - Part 44 of 184
PDF
Scala in Places API
Solr As A SparkSQL DataSource
ScalikeJDBC Tutorial for Beginners
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Introduction to Spark with Scala
Meetup spark structured streaming
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Testing batch and streaming Spark applications
3 database-jdbc(1)
Spark Summit EU talk by Ted Malaska
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Big Data Analytics with Scala at SCALA.IO 2013
Spark Streaming Programming Techniques You Should Know with Gerard Maas
The Ring programming language version 1.5.4 book - Part 37 of 185
SparkSQLの構文解析
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184
Scala in Places API
Ad

More from DaeMyung Kang (20)

PPTX
Count min sketch
PDF
PDF
Ansible
PDF
Why GUID is needed
PDF
How to use redis well
PPTX
The easiest consistent hashing
PDF
How to name a cache key
PDF
Integration between Filebeat and logstash
PDF
How to build massive service for advance
PDF
Massive service basic
PDF
Data Engineering 101
PDF
How To Become Better Engineer
PPTX
Kafka timestamp offset_final
PPTX
Kafka timestamp offset
PPTX
Data pipeline and data lake
PDF
Redis acl
PDF
Coffee store
PDF
Scalable webservice
PDF
Number system
PDF
webservice scaling for newbie
Count min sketch
Ansible
Why GUID is needed
How to use redis well
The easiest consistent hashing
How to name a cache key
Integration between Filebeat and logstash
How to build massive service for advance
Massive service basic
Data Engineering 101
How To Become Better Engineer
Kafka timestamp offset_final
Kafka timestamp offset
Data pipeline and data lake
Redis acl
Coffee store
Scalable webservice
Number system
webservice scaling for newbie

Recently uploaded (20)

DOCX
search engine optimization ppt fir known well about this
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Architecture types and enterprise applications.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Five Habits of High-Impact Board Members
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
Modernising the Digital Integration Hub
PDF
STKI Israel Market Study 2025 version august
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
TEXTILE technology diploma scope and career opportunities
search engine optimization ppt fir known well about this
Convolutional neural network based encoder-decoder for efficient real-time ob...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Architecture types and enterprise applications.pdf
Custom Battery Pack Design Considerations for Performance and Safety
Taming the Chaos: How to Turn Unstructured Data into Decisions
Consumable AI The What, Why & How for Small Teams.pdf
Benefits of Physical activity for teenagers.pptx
Five Habits of High-Impact Board Members
Enhancing plagiarism detection using data pre-processing and machine learning...
Developing a website for English-speaking practice to English as a foreign la...
Build Your First AI Agent with UiPath.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
A proposed approach for plagiarism detection in Myanmar Unicode text
Modernising the Digital Integration Hub
STKI Israel Market Study 2025 version august
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Flame analysis and combustion estimation using large language and vision assi...
TEXTILE technology diploma scope and career opportunities

Using spark data frame for sql

  • 2. Create DataFrame From File val path = “abc.txt” val df = spark.read.text(path)
  • 3. Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF
  • 4. Spark DataFrame Column 1) col("column name") 2) $"column name" 1) And 2) are the same.
  • 5. Simple Iris TSV Logs http://www.math.uah.edu/stat/data/Fisher.txt Type PW PL SW SL 0 2 14 33 50 1 24 56 31 67 1 23 51 31 69 0 2 10 36 46 1 20 52 30 65 1 19 51 27 58
  • 6. Load TSV with StructType import org.apache.spark.sql.types._ var irisSchema = StructType(Array( StructField("Type", IntegerType, true), StructField("PetalWidth", IntegerType, true), StructField("PetalLength", IntegerType, true), StructField("SepalWidth", IntegerType, true), StructField("SepalLength", IntegerType, true) ))
  • 7. Load TSV with Encoder #1 import org.apache.spark.sql.Encoders case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, SepalWidth: Int, SepalLength: Int) var irisSchema = Encoders.product[IrisSchema].schema
  • 8. Load TSV var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV. option("header", "true"). // Does the file have a header line? option("delimiter", "t"). // Set delimiter to tab or comma. schema(irisSchema). // Schema that was built above. load("Fisher.txt") irisDf.show(5)
  • 9. Load TSV - Show Results scala> irisDf.show(5) +----+----------+-----------+----------+-----------+ |Type|PetalWidth|PetalLength|SepalWidth|SepalLength| +----+----------+-----------+----------+-----------+ | 0| 2| 14| 33| 50| | 1| 24| 56| 31| 67| | 1| 23| 51| 31| 69| | 0| 2| 10| 36| 46| | 1| 20| 52| 30| 65| +----+----------+-----------+----------+-----------+ only showing top 5 rows
  • 10. Using sqlContext sql Super easy way val view = df.createOrReplaceTempView("tmp_iris") val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
  • 11. Simple Select SQL: Select type, petalwidth + sepalwidth as sum_width from … val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth")) val resultDF = sumDF.selectExpr("Type", "sum_width") val resultDF = sumDF.selectExpr("*") ← select *
  • 12. Select with where SQL: Select type, petalwidth from … where petalwidth > 10 val whereDF = df.filter($"petalwidth" > 10) val whereDF = df.where($"petalwidth" > 10) //filter and where are the same val resultDF = whereDF.selectExpr("Type", "petalwidth")
  • 13. Select with order by SQL: Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc 1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc) 2) val sortDF = df.sort($"petalwidth", desc("sepalwidth")) 3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth")) 1), 2) And 3) are the same. val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
  • 14. Select with Group by SQL: Select type, max(petalwidth) A, min(sepalwidth) B from … group by type val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"), min($"sepalwidth").as("B")) val resultDF = groupDF.selectExpr("type", "A", "B")
  • 15. Tip - Support MapType<String, String> like Hive SQL in Hive: Create table test (type map<string, string>); Hive support str_to_map, but spark not support for dataframe(spark support str_to_map for hiveQL). Using udf to solve this. val string_line = "A=1,B=2,C=3" Val df = logsDF.withColumn("type", str_to_map(string_line))
  • 16. UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }