SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Oliver Țupran
From HelloWorld to
Configurable and Reusable
Apache Spark Applications
In Scala
#UnifiedDataAnalytics #SparkAISummit
whoami
3#UnifiedDataAnalytics #SparkAISummit
Oliver Țupran
Software Engineer
Aviation, Banking, Telecom...
Scala Enthusiast
Apache Spark Enthusiast
olivertupran
tupol
@olivertupran
Intro
Audience
● Professionals starting with Scala and Apache Spark
● Basic Scala knowledge is required
● Basic Apache Spark knowledge is required
4
Intro
#UnifiedDataAnalytics #SparkAISummit
Agenda
● Hello, World!
● Problems
● Solutions
● Summary
5
Intro
#UnifiedDataAnalytics #SparkAISummit
Hello, World!
6
Hello,World!
./bin/spark-shell
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
scala> linesWithSpark.count() // How many lines contain "Spark"?
res3: Long = 15
Source spark.apache.org/docs/latest/quick-start.html
#UnifiedDataAnalytics #SparkAISummit
Hello, World!
7
Hello,World!
Source spark.apache.org/docs/latest/quick-start.html
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
#UnifiedDataAnalytics #SparkAISummit
Problems
8
Problems
● Configuration mixed with the application logic
● IO can be much more complex than it looks
● Hard to test
#UnifiedDataAnalytics #SparkAISummit
Solutions
9
Solutions
● Clean separation of the business logic
● Spark session out of the box
● Configuration and validation support
● Encourage and facilitate testing
#UnifiedDataAnalytics #SparkAISummit
tupol/spark-utils
Business Logic Separation
10
Solutions
/**
* @tparam Context The type of the application context class.
* @tparam Result The output type of the run function.
*/
trait SparkRunnable[Context, Result] {
/**
* @param context context instance containing all the application specific configuration
* @param spark active spark session
* @return An instance of type Result
*/
def run(implicit spark: SparkSession, context: Context): Result
}
Source github.com/tupol/spark-utils
#UnifiedDataAnalytics #SparkAISummit
Stand-Alone App Blueprint
11
Solutions
trait SparkApp[Context, Result] extends SparkRunnable[Context, Result] with Logging {
def appName: String = . . .
private def applicationConfiguration(implicit spark: SparkSession, args: Array[String]):
com.typesafe.config.Config = . . .
def createSparkSession(runnerName: String): SparkSession =
. . .
def createContext(config: com.typesafe.config.Config): Context
def main(implicit args: Array[String]): Unit = {
// Create a SparkSession, initialize a Typesafe Config instance,
// validate and initialize the application context,
// execute the run() function, close the SparkSession and
// return the result or throw and Exception
. . .
}
}
Source github.com/tupol/spark-utils
#UnifiedDataAnalytics #SparkAISummit
Back to SimpleApp
12
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
Solutions
Source spark.apache.org/docs/latest/quick-start.html
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
13
Source github.com/tupol/spark-utils-demos/
object SimpleApp extends SparkApp[Unit, Unit]{
override def createContext(config: Config): Unit = Unit
override def run(implicit spark: SparkSession, context: Unit): Unit {
val logFile = "YOUR_SPARK_HOME/README.md"
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
}
}
Solutions
1
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
14
object SimpleApp extends SparkApp[Unit, Unit]{
override def createContext (config: Config): Unit = Unit
override def run(implicit spark: SparkSession , context: Unit): Unit = {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val logData = spark.read.textFile(logFile).cache()
val (numAs, numBs) = appLogic(logData )
println (s"Lines with a: $numAs, Lines with b: $numBs")
}
def appLogic(data: Dataset[String]): (Long, Long) = {
val numAs = data.filter(line => line.contains("a")).count()
val numBs = data.filter(line => line.contains("b")).count()
(numAs, numBs)
}
}
Solutions
2
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
15
Solutions
3case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String)
object SimpleApp extends SparkApp[SimpleAppContext, Unit]{
override def createContext(config: Config): SimpleAppContext = ???
override def run(implicit spark: SparkSession , context: SimpleAppContext ): Unit = {
val logData = spark.source(context.input).read.as[String].cache
val (numAs, numBs) = appLogic(logData )
println (s"Lines with a: $numAs, Lines with b: $numBs")
}
def appLogic(data: Dataset[String] , context: SimpleAppContext): (Long, Long) = {
val numAs = data.filter(line => line.contains(context.filterA )).count()
val numBs = data.filter(line => line.contains(context.filterB )).count()
(numAs, numBs)
} Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Why Typesafe Config?
16
Solutions
● supports files in three formats: Java properties, JSON, and a human-friendly JSON superset
● merges multiple files across all formats
● can load from files, URLs or classpath
● users can override the config with Java system properties, java -Dmyapp.foo.bar=10
● supports configuring an app, with its framework and libraries, all from a single file such as
application.conf
● extracts typed properties
● JSON superset features:
○ comments
○ includes
○ substitutions ("foo" : ${bar}, "foo" : Hello ${who})
○ properties-like notation (a.b=c)
○ less noisy, more lenient syntax
○ substitute environment variables (logdir=${HOME}/logs)
○ lists
Source github.com/lightbend/config
#UnifiedDataAnalytics #SparkAISummit
Application Configuration File
17
Solutions
Hocon
SimpleApp {
input {
format: text
path: SPARK_HOME/README.md
}
filterA: A
filterB: B
}
Java Properties
SimpleApp.input.format=text
SimpleApp.input.path=SPARK_HOME/README.md
SimpleApp.filterA=A
SimpleApp.filterB=B
#UnifiedDataAnalytics #SparkAISummit
Configuration and Validation
18
Solutions
case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String)
object SimpleAppContext extends Configurator[SimpleAppContext] {
import org.tupol.utils.config._
override def validationNel(config: com.typesafe.config.Config):
scalaz .ValidationNel [Throwable, SimpleAppContext ] = {
config.extract[FileSourceConfiguration]("input")
.ensure(new IllegalArgumentException (
"Only 'text' format files are supported" ).toNel)(_.format == FormatType.Text) |@|
config.extract[ String]("filterA") |@|
config.extract[ String]("filterB") apply
SimpleAppContext .apply
}
} Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Configurator Framework?
19
Solutions
● DSL for easy definition of the context
○ config.extract[Double](“parameter.path”)
○ |@| operator to compose the extracted parameters
○ apply to build the configuration case class
● Type based configuration parameters extraction
○ extract[Double](“parameter.path”)
○ extract[Option[Seq[Double]]](“parameter.path”)
○ extract[Map[Int, String]](“parameter.path”)
○ extract[Either[Int, String]](“parameter.path”)
● Implicit Configurators can be used as extractors in the DSL
○ config.extract[SimpleAppContext](“configuration.path”)
● The ValidationNel contains either a list of exceptions or the application context
#UnifiedDataAnalytics #SparkAISummit
SimpleApp as SparkApp
20
Solutions
4case class SimpleAppContext (input: FileSourceConfiguration , filterA: String, filterB: String)
object SimpleApp extends SparkApp[SimpleAppContext , (Long, Long)]{
override def createContext (config: Config): SimpleAppContext = SimpleAppContext(config).get
override def run(implicit spark: SparkSession , context: SimpleAppContext ): (Long, Long) = {
val logData = spark.source(context.input).read.as[ String].cache
val (numAs, numBs) = appLogic(logData )
println (s"Lines with a: $numAs, Lines with b: $numBs")
(numAs, numBs)
}
def appLogic(data: Dataset[String] , context: SimpleAppContext ): (Long, Long) = {
. . .
}
} Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
CSV JSON XML JDBC ...
format
URI (connection URL, file path…)
schema
sep
encoding
quote
escape
comment
header
inferSchema
ignoreLeadingWhiteSpace
ignoreTrailingWhiteSpace
nullValue
nanValue
positiveInf
negativeInf
dateFormat
timestampFormat
maxColumns
maxCharsPerColumn
maxMalformedLogPerPartition
mode
primitivesAsString
prefersDecimal
allowComments
allowUnquotedFieldNames
allowSingleQuotes
allowNumericLeadingZeros
allowBackslashEscapingAnyCharacter
mode
columnNameOfCorruptRecord
dateFormat
timestampFormat
rowTag
samplingRatio
excludeAttribute
treatEmptyValuesAsNulls
mode
columnNameOfCorruptRecord
attributePrefix
valueTag
charset
ignoreSurroundingSpaces
table
columnName
lowerBound
upperBound
numPartitions
connectionProperties
Data Sources and Data Sinks
21
Solutions
Source spark.apache.org/docs/latest/
#UnifiedDataAnalytics #SparkAISummit
Data Sources and Data Sinks
22
Solutions
import org.tupol.spark.implicits._
import org.tupol.spark.io._
import spark.implicits._
. . .
val input = config.extract[FileSourceConfiguration]("input").get
val lines = spark.source(input).read.as[String]
// org.tupol.spark.io.FileDataSource(input).read
// spark.read.format(...).option(...).option(...).schema(...).load()
val output = config.extract[FileSinkConfiguration]("output").get
lines.sink(output).write
// org.tupol.spark.io.FileDataSink(output).write(lines)
// lines.write.format(...).option(...).option(...).partitionBy(...).mode(...)
#UnifiedDataAnalytics #SparkAISummit
Data Sources and Data Sinks
23
Solutions
● Very concise and intuitive DSL
● Support for multiple formats: text, csv, json, xml, avro, parquet, orc, jdbc, ...
● Specify a schema on read
● Schema is passed as a full json structure, as serialised by the StructType
● Specify the partitioning and bucketing for writing the data
● Structured streaming support
● Delta Lake support
● . . .
#UnifiedDataAnalytics #SparkAISummit
Test! Test! Test!
24
AWorldofOpportunities
class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession {
. . .
val DummyInput = FileSourceConfiguration("no path", TextSourceConfiguration())
val DummyContext = SimpleAppContext(input = DummyInput, filterA = "", filterB = "")
test("appLogic should return 0 counts of a and b for an empty DataFrame") {
val testData = spark.emptyDataset[String]
val result = SimpleApp.appLogic(testData, DummyContext)
result shouldBe (0, 0)
}
. . .
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Test! Test! Test!
25
Solutions
class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession {
. . .
test("run should return (1, 2) as count of a and b for the given data") {
val inputSource = FileSourceConfiguration("src/test/resources/input-test-01",
TextSourceConfiguration())
val context = SimpleAppContext(input = inputSource, filterA = "a", filterB = "b")
val result = SimpleApp.run(spark, context)
result shouldBe (1, 2)
}
. . .
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Format Converter
26
Solutions
case class MyAppContext(input : FormatAwareDataSourceConfiguration,
output: FormatAwareDataSinkConfiguration)
object MyAppContext extends Configurator[MyAppContext] {
import scalaz.ValidationNel
import scalaz.syntax.applicative._
def validationNel(config: Config): ValidationNel[Throwable, MyAppContext] = {
config.extract[FormatAwareDataSourceConfiguration]("input") |@|
config.extract[FormatAwareDataSinkConfiguration]("output") apply
MyAppContext.apply
}
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Format Converter
27
Solutions
object MyApp extends SparkApp[MyAppContext, DataFrame] {
override def createContext(config: Config): MyAppContext = MyAppContext(config).get
override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = {
val data = spark.source(context.input).read
data.sink(context.output).write
}
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Beyond Format Converter
28
Solutions
object MyApp extends SparkApp[MyAppContext, DataFrame] {
override def createContext(config: Config): MyAppContext = MyAppContext(config).get
override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = {
val inputData = spark.source(context.input).read
val outputData = transform(inputData)
outputData.sink(context.output).write
}
def transform(data: DataFrame)(implicit spark: SparkSession, context: MyAppContext) = {
data // Transformation logic here
}
}
Source github.com/tupol/spark-utils-demos/
#UnifiedDataAnalytics #SparkAISummit
Summary
29
Summary
● Write Apache Spark applications with minimal ceremony
○ batch
○ structured streaming
● IO and general application configuration support
● Facilitates testing
● Increase productivity
tupol/spark-utils spark-tools spark-utils-demos spark-apps.seed.g8
#UnifiedDataAnalytics #SparkAISummit
What’s Next?
30
What’sNext?
● Support for more source types
● Improvements of the configuration framework
● Feedback is welcomed!
● Help is welcomed!
#UnifiedDataAnalytics #SparkAISummit
olivertupran
tupol
@olivertupran
References
Presentation https://tinyurl.com/yxuneqcs
spark-utils https://github.com/tupol/spark-utils
spark-utils-demos https://github.com/tupol/spark-utils-demos
spark-apps.seed.g8 https://github.com/tupol/spark-apps.seed.g8
spark-tools https://github.com/tupol/spark-tools
Lightbend Config https://github.com/lightbend/config
Giter8 http://www.foundweekends.org/giter8/
Apache Spark http://spark.apache.org/
ScalaZ https://github.com/scalaz/scalaz
Scala https://scala-lang.org
31
References
#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

Introduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
iceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
MySQL Group Replication
MySQL Group Replication
Kenny Gryp
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Snowflake Architecture.pptx
Snowflake Architecture.pptx
chennakesava44
 
Azure Data Factory Data Flow
Azure Data Factory Data Flow
Mark Kromer
 
Test strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Introduction to Apache Hive
Introduction to Apache Hive
Avkash Chauhan
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
iceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
MySQL Group Replication
MySQL Group Replication
Kenny Gryp
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Snowflake Architecture.pptx
Snowflake Architecture.pptx
chennakesava44
 
Azure Data Factory Data Flow
Azure Data Factory Data Flow
Mark Kromer
 
Test strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 

Similar to From HelloWorld to Configurable and Reusable Apache Spark Applications in Scala – A Developer’s Journey (20)

Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
Databricks
 
Spark浅谈
Spark浅谈
Jiahua Zhu
 
Introduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Physical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and Analytics
Databricks
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Databricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Physical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and Analytics
Databricks
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Databricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
Philippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitality
kikomendoza006
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
Module 1Integrity_and_Ethics_PPT-2025.pptx
Module 1Integrity_and_Ethics_PPT-2025.pptx
Karikalcholan Mayavan
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays New York 2025 - API Security and Observability at Scale in Kubernetes...
apidays
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
Philippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitality
kikomendoza006
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
Grote OSM datasets zonder kopzorgen bij Reijers
Grote OSM datasets zonder kopzorgen bij Reijers
jacoba18
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
Module 1Integrity_and_Ethics_PPT-2025.pptx
Module 1Integrity_and_Ethics_PPT-2025.pptx
Karikalcholan Mayavan
 

From HelloWorld to Configurable and Reusable Apache Spark Applications in Scala – A Developer’s Journey

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Oliver Țupran From HelloWorld to Configurable and Reusable Apache Spark Applications In Scala #UnifiedDataAnalytics #SparkAISummit
  • 3. whoami 3#UnifiedDataAnalytics #SparkAISummit Oliver Țupran Software Engineer Aviation, Banking, Telecom... Scala Enthusiast Apache Spark Enthusiast olivertupran tupol @olivertupran Intro
  • 4. Audience ● Professionals starting with Scala and Apache Spark ● Basic Scala knowledge is required ● Basic Apache Spark knowledge is required 4 Intro #UnifiedDataAnalytics #SparkAISummit
  • 5. Agenda ● Hello, World! ● Problems ● Solutions ● Summary 5 Intro #UnifiedDataAnalytics #SparkAISummit
  • 6. Hello, World! 6 Hello,World! ./bin/spark-shell scala> val textFile = spark.read.textFile("README.md") textFile: org.apache.spark.sql.Dataset[String] = [value: string] scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string] scala> linesWithSpark.count() // How many lines contain "Spark"? res3: Long = 15 Source spark.apache.org/docs/latest/quick-start.html #UnifiedDataAnalytics #SparkAISummit
  • 7. Hello, World! 7 Hello,World! Source spark.apache.org/docs/latest/quick-start.html object SimpleApp { def main(args: Array[String]) { val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() } } #UnifiedDataAnalytics #SparkAISummit
  • 8. Problems 8 Problems ● Configuration mixed with the application logic ● IO can be much more complex than it looks ● Hard to test #UnifiedDataAnalytics #SparkAISummit
  • 9. Solutions 9 Solutions ● Clean separation of the business logic ● Spark session out of the box ● Configuration and validation support ● Encourage and facilitate testing #UnifiedDataAnalytics #SparkAISummit tupol/spark-utils
  • 10. Business Logic Separation 10 Solutions /** * @tparam Context The type of the application context class. * @tparam Result The output type of the run function. */ trait SparkRunnable[Context, Result] { /** * @param context context instance containing all the application specific configuration * @param spark active spark session * @return An instance of type Result */ def run(implicit spark: SparkSession, context: Context): Result } Source github.com/tupol/spark-utils #UnifiedDataAnalytics #SparkAISummit
  • 11. Stand-Alone App Blueprint 11 Solutions trait SparkApp[Context, Result] extends SparkRunnable[Context, Result] with Logging { def appName: String = . . . private def applicationConfiguration(implicit spark: SparkSession, args: Array[String]): com.typesafe.config.Config = . . . def createSparkSession(runnerName: String): SparkSession = . . . def createContext(config: com.typesafe.config.Config): Context def main(implicit args: Array[String]): Unit = { // Create a SparkSession, initialize a Typesafe Config instance, // validate and initialize the application context, // execute the run() function, close the SparkSession and // return the result or throw and Exception . . . } } Source github.com/tupol/spark-utils #UnifiedDataAnalytics #SparkAISummit
  • 12. Back to SimpleApp 12 object SimpleApp { def main(args: Array[String]) { val spark = SparkSession.builder.appName("Simple Application").getOrCreate() val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() } } Solutions Source spark.apache.org/docs/latest/quick-start.html #UnifiedDataAnalytics #SparkAISummit
  • 13. SimpleApp as SparkApp 13 Source github.com/tupol/spark-utils-demos/ object SimpleApp extends SparkApp[Unit, Unit]{ override def createContext(config: Config): Unit = Unit override def run(implicit spark: SparkSession, context: Unit): Unit { val logFile = "YOUR_SPARK_HOME/README.md" val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") } } Solutions 1 #UnifiedDataAnalytics #SparkAISummit
  • 14. SimpleApp as SparkApp 14 object SimpleApp extends SparkApp[Unit, Unit]{ override def createContext (config: Config): Unit = Unit override def run(implicit spark: SparkSession , context: Unit): Unit = { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val logData = spark.read.textFile(logFile).cache() val (numAs, numBs) = appLogic(logData ) println (s"Lines with a: $numAs, Lines with b: $numBs") } def appLogic(data: Dataset[String]): (Long, Long) = { val numAs = data.filter(line => line.contains("a")).count() val numBs = data.filter(line => line.contains("b")).count() (numAs, numBs) } } Solutions 2 Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 15. SimpleApp as SparkApp 15 Solutions 3case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String) object SimpleApp extends SparkApp[SimpleAppContext, Unit]{ override def createContext(config: Config): SimpleAppContext = ??? override def run(implicit spark: SparkSession , context: SimpleAppContext ): Unit = { val logData = spark.source(context.input).read.as[String].cache val (numAs, numBs) = appLogic(logData ) println (s"Lines with a: $numAs, Lines with b: $numBs") } def appLogic(data: Dataset[String] , context: SimpleAppContext): (Long, Long) = { val numAs = data.filter(line => line.contains(context.filterA )).count() val numBs = data.filter(line => line.contains(context.filterB )).count() (numAs, numBs) } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 16. Why Typesafe Config? 16 Solutions ● supports files in three formats: Java properties, JSON, and a human-friendly JSON superset ● merges multiple files across all formats ● can load from files, URLs or classpath ● users can override the config with Java system properties, java -Dmyapp.foo.bar=10 ● supports configuring an app, with its framework and libraries, all from a single file such as application.conf ● extracts typed properties ● JSON superset features: ○ comments ○ includes ○ substitutions ("foo" : ${bar}, "foo" : Hello ${who}) ○ properties-like notation (a.b=c) ○ less noisy, more lenient syntax ○ substitute environment variables (logdir=${HOME}/logs) ○ lists Source github.com/lightbend/config #UnifiedDataAnalytics #SparkAISummit
  • 17. Application Configuration File 17 Solutions Hocon SimpleApp { input { format: text path: SPARK_HOME/README.md } filterA: A filterB: B } Java Properties SimpleApp.input.format=text SimpleApp.input.path=SPARK_HOME/README.md SimpleApp.filterA=A SimpleApp.filterB=B #UnifiedDataAnalytics #SparkAISummit
  • 18. Configuration and Validation 18 Solutions case class SimpleAppContext(input: FileSourceConfiguration , filterA: String, filterB: String) object SimpleAppContext extends Configurator[SimpleAppContext] { import org.tupol.utils.config._ override def validationNel(config: com.typesafe.config.Config): scalaz .ValidationNel [Throwable, SimpleAppContext ] = { config.extract[FileSourceConfiguration]("input") .ensure(new IllegalArgumentException ( "Only 'text' format files are supported" ).toNel)(_.format == FormatType.Text) |@| config.extract[ String]("filterA") |@| config.extract[ String]("filterB") apply SimpleAppContext .apply } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 19. Configurator Framework? 19 Solutions ● DSL for easy definition of the context ○ config.extract[Double](“parameter.path”) ○ |@| operator to compose the extracted parameters ○ apply to build the configuration case class ● Type based configuration parameters extraction ○ extract[Double](“parameter.path”) ○ extract[Option[Seq[Double]]](“parameter.path”) ○ extract[Map[Int, String]](“parameter.path”) ○ extract[Either[Int, String]](“parameter.path”) ● Implicit Configurators can be used as extractors in the DSL ○ config.extract[SimpleAppContext](“configuration.path”) ● The ValidationNel contains either a list of exceptions or the application context #UnifiedDataAnalytics #SparkAISummit
  • 20. SimpleApp as SparkApp 20 Solutions 4case class SimpleAppContext (input: FileSourceConfiguration , filterA: String, filterB: String) object SimpleApp extends SparkApp[SimpleAppContext , (Long, Long)]{ override def createContext (config: Config): SimpleAppContext = SimpleAppContext(config).get override def run(implicit spark: SparkSession , context: SimpleAppContext ): (Long, Long) = { val logData = spark.source(context.input).read.as[ String].cache val (numAs, numBs) = appLogic(logData ) println (s"Lines with a: $numAs, Lines with b: $numBs") (numAs, numBs) } def appLogic(data: Dataset[String] , context: SimpleAppContext ): (Long, Long) = { . . . } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 21. CSV JSON XML JDBC ... format URI (connection URL, file path…) schema sep encoding quote escape comment header inferSchema ignoreLeadingWhiteSpace ignoreTrailingWhiteSpace nullValue nanValue positiveInf negativeInf dateFormat timestampFormat maxColumns maxCharsPerColumn maxMalformedLogPerPartition mode primitivesAsString prefersDecimal allowComments allowUnquotedFieldNames allowSingleQuotes allowNumericLeadingZeros allowBackslashEscapingAnyCharacter mode columnNameOfCorruptRecord dateFormat timestampFormat rowTag samplingRatio excludeAttribute treatEmptyValuesAsNulls mode columnNameOfCorruptRecord attributePrefix valueTag charset ignoreSurroundingSpaces table columnName lowerBound upperBound numPartitions connectionProperties Data Sources and Data Sinks 21 Solutions Source spark.apache.org/docs/latest/ #UnifiedDataAnalytics #SparkAISummit
  • 22. Data Sources and Data Sinks 22 Solutions import org.tupol.spark.implicits._ import org.tupol.spark.io._ import spark.implicits._ . . . val input = config.extract[FileSourceConfiguration]("input").get val lines = spark.source(input).read.as[String] // org.tupol.spark.io.FileDataSource(input).read // spark.read.format(...).option(...).option(...).schema(...).load() val output = config.extract[FileSinkConfiguration]("output").get lines.sink(output).write // org.tupol.spark.io.FileDataSink(output).write(lines) // lines.write.format(...).option(...).option(...).partitionBy(...).mode(...) #UnifiedDataAnalytics #SparkAISummit
  • 23. Data Sources and Data Sinks 23 Solutions ● Very concise and intuitive DSL ● Support for multiple formats: text, csv, json, xml, avro, parquet, orc, jdbc, ... ● Specify a schema on read ● Schema is passed as a full json structure, as serialised by the StructType ● Specify the partitioning and bucketing for writing the data ● Structured streaming support ● Delta Lake support ● . . . #UnifiedDataAnalytics #SparkAISummit
  • 24. Test! Test! Test! 24 AWorldofOpportunities class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession { . . . val DummyInput = FileSourceConfiguration("no path", TextSourceConfiguration()) val DummyContext = SimpleAppContext(input = DummyInput, filterA = "", filterB = "") test("appLogic should return 0 counts of a and b for an empty DataFrame") { val testData = spark.emptyDataset[String] val result = SimpleApp.appLogic(testData, DummyContext) result shouldBe (0, 0) } . . . } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 25. Test! Test! Test! 25 Solutions class SimpleAppSpec extends FunSuite with Matchers with SharedSparkSession { . . . test("run should return (1, 2) as count of a and b for the given data") { val inputSource = FileSourceConfiguration("src/test/resources/input-test-01", TextSourceConfiguration()) val context = SimpleAppContext(input = inputSource, filterA = "a", filterB = "b") val result = SimpleApp.run(spark, context) result shouldBe (1, 2) } . . . } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 26. Format Converter 26 Solutions case class MyAppContext(input : FormatAwareDataSourceConfiguration, output: FormatAwareDataSinkConfiguration) object MyAppContext extends Configurator[MyAppContext] { import scalaz.ValidationNel import scalaz.syntax.applicative._ def validationNel(config: Config): ValidationNel[Throwable, MyAppContext] = { config.extract[FormatAwareDataSourceConfiguration]("input") |@| config.extract[FormatAwareDataSinkConfiguration]("output") apply MyAppContext.apply } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 27. Format Converter 27 Solutions object MyApp extends SparkApp[MyAppContext, DataFrame] { override def createContext(config: Config): MyAppContext = MyAppContext(config).get override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = { val data = spark.source(context.input).read data.sink(context.output).write } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 28. Beyond Format Converter 28 Solutions object MyApp extends SparkApp[MyAppContext, DataFrame] { override def createContext(config: Config): MyAppContext = MyAppContext(config).get override def run(implicit spark: SparkSession, context: MyAppContext): DataFrame = { val inputData = spark.source(context.input).read val outputData = transform(inputData) outputData.sink(context.output).write } def transform(data: DataFrame)(implicit spark: SparkSession, context: MyAppContext) = { data // Transformation logic here } } Source github.com/tupol/spark-utils-demos/ #UnifiedDataAnalytics #SparkAISummit
  • 29. Summary 29 Summary ● Write Apache Spark applications with minimal ceremony ○ batch ○ structured streaming ● IO and general application configuration support ● Facilitates testing ● Increase productivity tupol/spark-utils spark-tools spark-utils-demos spark-apps.seed.g8 #UnifiedDataAnalytics #SparkAISummit
  • 30. What’s Next? 30 What’sNext? ● Support for more source types ● Improvements of the configuration framework ● Feedback is welcomed! ● Help is welcomed! #UnifiedDataAnalytics #SparkAISummit olivertupran tupol @olivertupran
  • 31. References Presentation https://tinyurl.com/yxuneqcs spark-utils https://github.com/tupol/spark-utils spark-utils-demos https://github.com/tupol/spark-utils-demos spark-apps.seed.g8 https://github.com/tupol/spark-apps.seed.g8 spark-tools https://github.com/tupol/spark-tools Lightbend Config https://github.com/lightbend/config Giter8 http://www.foundweekends.org/giter8/ Apache Spark http://spark.apache.org/ ScalaZ https://github.com/scalaz/scalaz Scala https://scala-lang.org 31 References #UnifiedDataAnalytics #SparkAISummit
  • 32. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT