DIY Analytics with Apache Spark

DIY ANALYTICS WITH
APACHE SPARK
ADAM ROBERTS
London, 22nd
June 2017: originally presented at Geecon

Important disclaimers
Copyright © 2017 by Internatonal Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written
permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Informaton in these presentatons (including informaton relatng to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of inital
publicaton and could include unintentonal technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS"
without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this informaton, including but not limited to, loss of data,
business interrupton, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they
are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers
have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operatng environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all
countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and
discussions are provided for informational purposes only, and are neither intended to, nor shall consttute legal or other guidance or advice to any individual partcipant or their
specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identficaton
and interpretaton of any relevant laws and regulatory requirements that may affect the customer’s business and any actons the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
Information within this presentation is accurate to the best of the author's knowledge as of the 4th
of June 2017

Informaton concerning non-IBM products was obtained from the suppliers of those products, their published announcements or
other publicly available sources. IBM has not tested those products in connecton with this publication and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilites of non-IBM
products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or
the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed
or implied, including but not limited to, the implied warrantes of merchantability and fitness for a partcular purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM
patents, copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™,
Global Business Services ®, Global Technology Services ®, Informaton on Demand, ILOG, LinuxONE™, Maximo®,
MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytcs™, PureApplicaton®, pureCluster™,
PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Ratonal®,
Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System
z® Z/OS, are trademarks of Internatonal Business Machines Corporaton, registered in many jurisdictons worldwide. Other
product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of
Oracle and/or its afiliates. Other names may be trademarks of their respectve owners: and a current list of IBM trademarks is
available on the Web at "Copyright and trademark informaton" at www.ibm.com/legal/copytrade.shtml. Apache Spark,
Apache Cassandra, Apache Hadoop, Apache Maven, Apache Kafka and any other Apache project mentoned here and the
Apache product logos including the Spark logo are trademarks of The Apache Software Foundaton.

●
Showing you how to get started from scratch:
going from “I’ve heard about Spark” to “I can use it for...”
●
Worked examples aplenty: lots of code
●
Not intended to be scientfically accurate! Sharing ideas
●
Useful reference material
●
Slides will be hosted
Stick around for...

✔
Doing stuf yourself (within your
tmeframe and rules)
✔
Findings can be subject to bias: yours
don’t have to be
✔
Trust the data instead
Motivation!

✔
Finding aliens with the SETI insttute
✔
Genomics projects (GATK, Bluemix
Genomics)
✔
IBM Watson services
Cool projects involving Spark

✔
Powerful machine(s)
✔
Apache Spark and a JDK
✔
Scala (recommended)
✔
Optonal: visualisation library for Spark output e.g. Python with
✔
bokeh
✔
pandas
✔
Optonal but not covered here: a notebook bundled with Spark
like Zeppelin, or use Jupyter
Your DIY analytcs toolkit
Toolbox from wikimedia: Tanemori derivatve work: ‫י‬‫ק‬‫נ‬‫א‬'‫ג‬‫י‬‫ק‬‫יו‬

Why listen to me?
●
Worked on Apache Spark since 2014
●
Helping IBM customers use Spark for the first tme
●
Resolving problems, educatng service teams
●
Testng on lots of IBM platforms since Spark 1.2: x86, Power, Z systems,
all Java 8 deliverables...
●
Fixing bugs in Spark/Java: contributng code and helping others to do so
●
Working with performance tuning pros
●
Code provided here has an emphasis on readability!

●
What is it (why the hype)?
●
How to answer questons with Spark
●
Core spark functons (the “bread and butter” stuf),
plotting, correlatons, machine learning
●
Built-in utlity functons to make our lives easier (labels,
features, handling nulls)
●
Examples using data from wearables: two years of actvity
What I'll be covering today

Ask me later if you're interested in...
●
Spark on IBM hardware
●
IBM SDK for Java specifics
●
Notebooks
●
Spark using GPUs/GPUs from Java
●
Performance tuning
●
Comparison with other projects
●
War stories fixing Spark/Java bugs

●
You know how to write Java or Scala
●
You’ve heard about Spark but never used it
●
You have something to process!
What I assume...

This talk won’t make you a
superhero!

●
Know more about Spark – what it can/can’t do
●
Know more about machine learning in Spark
●
Know that machine learning’s stll hard but in
diferent ways
But you will...

Open source project (the most actve for big data)
offering distributed...
●
Machine learning
●
Graph processing
●
Core operatons (map, reduce, joins)
●
SQL syntax with DataFrames/Datasets

✔
Build it yourself from source (requiring
Git, Maven, a JDK) or
✔
Download a community built binary or
✔
Download our free Spark
development package (includes IBM's
SDK for Java)

Things you can process...
●
File formats you could use with Hadoop
●
Anything there’s a Spark package for
●
json, csv, parquet...
Things you can use with it...
●
Kafka for streaming
●
Hive tables
●
Cassandra as a database
●
Hadoop (using HDFS with Spark)
●
DB2!

“What’s so good about it then?”

●
Offers scalability and resiliency
●
Auto-compression, fast serialisaton, caching
●
Python, R, Scala and Java APIs: eligible for Java
based optmisations
●
Distributed machine learning!

“Why isn’t everyone using it?”

●
Can you get away with using spreadsheet software?
●
Have you really got a large amount of data?
●
Data preparation is very important!
How will you properly handle negative, null, or otherwise
strange values in your data?
●
Will you benefit from massive concurrency?
●
Is the data in a format you can work with?
●
Needs transforming first (and is it worth it)?
Not every problem is a Spark one!

●
Not really real-tme streaming (“micro-batching”)
●
Debugging in a largely distributed system with many
moving parts can be tough
●
Security: not really locked down out of the box (extra
steps required by knowledgable users: whole disk
encrypton or using other projects, SSL config to do...)
Implementation details...

Getting something up and
running quickly

Run any Spark example in “local mode” first (from “spark”)
bin/run-example org.apache.spark.examples.SparkPi 100
Then run it on a cluster you can set up yourself:
Add hostnames in conf/slaves
sbin/start-all.sh
bin/run-example –master <your_master:7077> ...
Check for running Java processes: looking for workers/executors coming and going
Spark UI (default port 8080 on the master)
See: http://spark.apache.org/docs/latest/spark-standalone.html
lib is only with the IBM package
Running something simple

And you can use Spark's Java/Scala APIs with
bin/spark-shell (a REPL!)
bin/spark-submit
java/scala -cp “$SPARK_HOME/jars/*”
PySpark not covered in this presentation – but fun to
experiment with and lots of good docs online for you

Increasing the number of threads available for Spark
processing in local mode (5.2gb text file) – actually works?
--master local[1]
real 3m45.328s
--master local[4]
real 1m31.889s
time {
echo "--master local[1]"
$SPARK_HOME/bin/spark-submit
--master local[1] --class MyClass
WordCount.jar
}
time {
echo "--master local[4]"
$SPARK_HOME/bin/spark-submit
--master local[4] –class MyClass
WordCount.jar
}

“Anything else good about Spark?”

●
Resiliency by replicaton and lineage tracking
●
Distribution of processing via (potentally many) workers that can
spawn (potentally many) executors
●
Caching! Keep data in memory, reuse later
●
Versatlity and interoperability
APIs include Spark core, ML, DataFrames and Datasets,
Streaming and Graphx ...
●
Read up on RDDs and ML material by Andrew Ng, Spark Summit
videos, deep dives on Catalyst/Tungsten if you want to really get
stuck in! This is a DIY talk

Recap – we know what it is
now...and want to do some
analytics!

●
Data I’ll process here is for educational
purposes only: road_accidents.csv
●
Kaggle is a good place to practice – lots of
datasets available for you
●
Data I'm using is licensed under the Open
Government License for public sector
information

"accident_index","vehicle_reference","vehicle_type","towing_and_articulation",
"vehicle_manoeuvre","vehicle_location”,restricted_lane","junction_location","skidding_and_overturning","hit_object_in_ca
rriageway","vehicle_leaving_carriageway","hit_object_off_carriageway","1st_point_of_impact","was_vehicle_left_hand_dri
ve?","journey_purpose_of_driver","sex_of_driver","age_of_driver","age_band_of_driver","engine_capacity_(cc)","propulsio
n_code","age_of_vehicle","driver_imd_decile","driver_home_area_type","vehicle_imd_decile","NUmber_of_Casualities_un
ique_to_accident_ind ex","No_of_Vehicles_involved_unique_to_accident_index","location_easting_osgr","location_north
ing_osgr","longitude","latitude","police_force","accident_severity","number_of_vehicles","number_of_casualties","date","da
y_of_week","time","local_authority_(district)","local_authority_(highway)","1st_road_class","1st_road_number","road_type",
"speed_limit","junction_detail","junction_control"," 2nd_road_class","2nd_road_number","pedestrian_crossing-
human_control","pedestrian_crossing-physical_facilities",
"light_conditions","weather_conditions","road_surface_conditions","special_conditions_at_site","carriageway_hazards","
urban_or_rural_area","did_police_officer_attend_scene_of_accident","lsoa_of_accident_location","casualty_reference","ca
sualty_class","sex_of_casualty","age_of_casualty","age_band_of_casualty","casualty_severity","pedestrian_location","pe
destrian_movement","car_passenger","bus_or_coach_passenger","pedestrian_road_maintenance_worker","casualty_type
","casualty_home_area_type","casualty_imd_decile"
Features of the data (“columns”)

"201506E098757",2,9,0,18,0,8,0,0,0,0,3,1,6,1,45,7,1794
,1,11,-1,1,-1,1,2,384980,394830,-
2.227629,53.450014,6,3,2,1,"42250",2,1899-12-30
12:56:00,102,"E08000003",5,0,6,30,3,4,6,0,0,0,1,1,1,0,
0,1,2,"E01005288",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA
"201506E098766",1,9,0,9,0,8,0,0,0,0,4,1,6,2,25,5,1582, 2,1,-1,-
1,-1,1,2,383870,394420,-
2.244322,53.446296,6,3,2,1,"14/03/2015",7,1899-12-30
15:55:00,102,"E08000003",3,5103,3,40,6,2,5,0,0,5,1,1,1
,0,0,1,1,"E01005178",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA, NA,NA,NA
Values (“rows”)

Spark way to figure this out?
groupBy* vehicle_type
sort** the results on count
vehicle_type maps to a code
First place: car
Distant second: pedal bike
Close third: van/goods HGV <= 3.5 T
Distant last: electric motorcycle
Type of vehicle involved in the most accidents?

Different column name this tme, weather_conditons
maps to a code again
First place: fine with no high winds
Second: raining, no high winds
Distant third: fine, with high winds
Distant last: snowing, high winds
groupBy* weather_conditions
weather_conditions maps to a code
What weather should I be avoiding?

First place: going ahead (!)
Distant second: turning right
Distant third: slowing or stopping
Last: reversing
Spark way...
groupBy* manoeuvre
manoeuvre maps to a code
Which manoeuvres should I be careful with?

“Why * and **?”
org.apache.spark functions that
can run in a distributed manner

Spark code example – I'm using Scala
●
Forced mutability consideration (val or var)
●
Not mandatory to declare types (or “return ...”)
●
Check out “Scala for the Intrigued” on YouTube
●
JVM based
Scala main method I’ll be using
object AccidentsExample {
def main(args: Array[String]) : Unit = {
}
}
Which age group gets in the most accidents?

Spark entrypoint
val session = SparkSession.builder().appName("Accidents").master("local[*]")
Creatng a DataFrame: API we’ll use to interact with data as
though it’s in an SQL table
val sqlContext = session.getOrCreate().sqlContext
val allAccidents = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true").
load(myHome + "/datasets/road_accidents.csv")
allAccidents.show would give us a table like...
accident_index vehicle_reference vehicle_type towing_and_articulation
201506E098757 2 9 0
201506E098766 1 9 0

Group our data and save the result
...
val myAgeDF = groupCountSortAndShow(allAccidents, "age_of_casualty", true)
myAgeDF.coalesce(1). write.option("header",
"true"). format("csv"). save("victims")
Runtime.getRuntime().exec("python plot_me.py" )
def groupCountSortAndShow(df: DataFrame, columnName: String, toShow:
val ourSortedData = df.groupBy(columnName).count().sort("count")
if(toShow)
ourSortedData.show()
ourSortedData
}
Boolean):DataFrame = {

“Hold on...
what’s that getRuntime().exec
stuff?!”

It’s calling my Python code to plot the CSV file
import glob, os, pandas
from bokeh.plotting import figure, output_file, show
path = r'victims'
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pandas.read_csv(f) for f in all_files)
df = pandas.concat(df_from_each_file, ignore_index=True)
plot = figure(plot_width=640,plot_height=640,title='Accident victims by age',
x_axis_label='Age of victim', y_axis_label='How many')
plot.title.text_font_size = '16pt'
plot.xaxis.axis_label_text_font_size = '16pt'
plot.yaxis.axis_label_text_font_size = '16pt'
plot.scatter(x=df.age_of_casualty, y=df['count'])
output_file('victims.html') show(plot)

Bokeh gives us a graph like this

You’ve got some JSON files...•
•
•
•
“Best doom metal band please”
sqlContext.sql("SELECT name, average_rating from bands WHERE " +
"genre == 'doom_metal'").sort(desc("average_rating")).show(1)
+--------------------+--------------+
| name|average_rating|
+--------------------+--------------+
|Bugle Infantry| 5|
+--------------------+--------------+
only showing top 1 row
val bandsDF = sqlContext.read.json(myHome + "/datasets/bands.json")
bandsDF.createGlobalTempView("bands")
import org.apache.spark.sql.functions._
{"id":"2","name":"Louder Bill","average_rating":"4.1","genre":"ambient"}
{"id":"3","name":"Prey Fury","average_rating":"2","genre":"pop"}
{"id":"4","name":"Unbranded Newsroom","average_rating":"4","genre":"rap"}
{"id":"5","name":"Bugle Infantry","average_rating":"5", "genre": "doom_metal"}
{"id":"1","name":"Into Latch","average_rating":"4.9","genre":"doom_metal"}
Randomly generated band names as of May the 18th
2017, zero affiliation on my behalf or IBM’s for any of these names...entirely coincidental if they do exist

“Great, but you mentioned
some data collected with
wearables and machine
learning!”

Anonymised data gathered from Automatc,
Apple Health, Withings, Jawbone Up
●
Car journeys
●
Sleeping activity (start and end tme)
●
Daytme actvity (calories consumed, steps taken)
●
Weight and heart rate
●
Several CSV files
●
Anonymised by subject gatherer before uploading anywhere! Nothing identfiable

Exploring the datasets: driving actvity
val autoData = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load(myHome + "/datasets/geecon/automatic.csv").
withColumnRenamed("End Location Name", "Location").
withColumnRenamed("End Time", "Time")

Checking our data is sensible...
val colsWeCareAbout =
"Distance (mi)",
"Duration (min)",
"Fuel Cost (USD)")
for (col <- colsWeCareAbout) {
summarise(autoData, col)
}
Array(
def summarise(df: DataFrame, columnName: String)
{ averageByCol(df, columnName)
minByCol(df, columnName)
maxByCol(df, columnName)
}
def averageByCol(df: DataFrame, columnName: String)
{ println("Printing the average " + columnName)
df.agg(avg(df.col(columnName))).show()
}
def minByCol(df: DataFrame, columnName: String)
{ println("Printing the minimum " + columnName)
df.agg(min(df.col(columnName))).show()
}
def maxByCol(df: DataFrame, columnName: String)
{ println("Printing the maximum " + columnName)
df.agg(max(df.col(columnName))).show()
}
Average distance (in miles): 6.88, minimum: 0.01, maximum: 187.03
Average duration (in minutes): 14.87, minimum: 0.2, maximum: 186.92
Average fuel Cost (in USD): 0.58, minimum: 0.0, maximum: 14.35

Looks OK - what’s the rate of Mr X visiting a
certain place? Got a favourite gym day?
Slacking on certain days?
●
Using Spark to determine chance of the subject being there
●
Timestamps (the “Time” column need to become days of the
week instead)
●
The start of a common theme: data preparaton!

Explore the data first
|Vehicle|Start Location Name|Start Time|Location|Time| Distance (mi)|Duration
(min)|Fuel Cost (USD)|Average MPG|Fuel Volume (gal)|Hard Accelerations|Hard Brakes|
Duration Over 70 mph (secs)|Duration Over 75 mph (secs)| Duration Over 80 mph
(secs)|Start Location Accuracy (meters)|End Location Accuracy (meters)|Tags|
...
|2005 Nissan
0.27|
0|
Sentra| PokeStop 12|4/3/2016 15:06|PokeStop 12|4/3/2016
0.03|
0|
15:07|
1.52| 0.04| 13.64|
0|
0|
0|
5.0| 5.0|
null|
|2005 Nissan
0.1|
0|
Sentra| PokeStop 12|4/3/2016 15:17|PokeStop 12|4/3/2016
0.0|
0|
15:18|
0.71| 0.01| 17.64|
0|
0|
0|
5.0| 5.0|
null|
autoData.show() ...

val preparedAutoData = sqlContext.sql(
"SELECT TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS TIMESTAMP))
as Date, Location, “ +
“date_format(TO_DATE(CAST(UNIX_TIMESTAMP(Time, 'MM/dd/yyyy') AS
TIMESTAMP)), 'EEEE') as Day FROM auto_data")
preparedAutoData.show()
Timestamp fun: 4/03/2016 15:06 is no good!
----------+-----------+---------+
|2016-04-03|PokeStop 12|
|2016-04-03|PokeStop 12|
Sunday|
Sunday|
Sunday||2016-04-03| Michaels|
...
+----------+-----------+---------
+
Date| Location | Day|

def printChanceLocationOnDay(
sqlContext: SQLContext, day: String, location: String) {
val allDatesAndDaysLogged = sqlContext.sql(
"SELECT Date, Day " +
"FROM prepared_auto_data " +
"WHERE Day = '" + day + "'").distinct()
allDatesAndDaysLogged.show()
Scala function: give us all of the rows where
the day is what we specified
+----------+------+
| Date| Day|
+----------+------+
|2016-10-17|Monday|
|2016-10-24|Monday|
|2016-04-25|Monday|
|2017-03-27|Monday|
|2016-08-15|Monday|
...

+----------+--------+------+
| Date|Location| Day|
+----------+--------+------+|2016-04-04|
|2016-11-14|
|2017-01-09|
|2017-02-06|
Gym|Monday|
Gym|Monday|
Gym|Monday|
Gym|Monday|
var rate = Math.floor( (Double.valueOf(allDatesAndDaysLogged.count()) /
Double.valueOf(visits.count())) * 100)
println(rate + "% rate of being at the location '" + location + "' on " + day +
", activity logged for " + allDatesAndDaysLogged + " " + day + "s")
val visits = sqlContext.sql(
"SELECT * FROM prepared_auto_data " +
"WHERE Location = '" + location + "' AND Day = '"
visits.show()
+ day + "'")
Rows where the location and day matches
our query (passed in as parameters)

●
7% rate of being at the location 'Gym' on Monday, activity logged for 51 Mondays
●
1% rate of being at the location 'Gym' on Tuesday, activity logged for 51 Tuesdays
●
2% rate of being at the location 'Gym' on Wednesday, activity logged for 49 Wednesdays
●
6% rate of being at the location 'Gym' on Thursday, activity logged for 47 Thursdays
●
7% rate of being at the location 'Gym' on Saturday, activity logged for 41 Saturdays
●
9% rate of being at the location 'Gym' on Sunday, activity logged for 41 Sundays
val days = Array("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
for (day <- days) {
printChanceLocationOnDay(sqlContext, autoData, day, "Gym")
}

Which feature(s) are closely related to another -
e.g. the time spent asleep?
Dataset has these features from Jawbone
●
s_duration (the sleep time as well...)
●
m_active_time
●
m_calories
●
m_distance
●
m_steps
●
m_total_calories
●
n_bedtime (hmm)
●
n_awake_time
How about correlations?

Very strong positive correlation for n_bedtime and s_asleep_time
Correlation between goal_body_weight and s_asleep time: -0.02
Val shouldBeLow = sleepData.stat.corr("goal_body_weight", "s_duration")
println("Correlation between goal body weight and sleep duration: " + shouldBeLow)
val compareToCol = "s_duration"
for (col <- sleepData.columns) {
If (! col.equals(compareToCol)) { // don’t compare to itself...
val corr = sleepData.stat.corr(col, compareToCol)
if (corr > 0.8) {
println("Very strong positive correlation for " + col + " and " +
compareToCol)
} else if (corr >= 0.5) {
println("Positive correlation for " + col + " and " + compareToCol)
}
}
}
And something we know isn’t related?

“...can Spark help me to get
a good sleep?”

Need to define a good sleep first
8 hours for this test subject
If duration is > 8 hours
good sleep = true, else false
I’m using 1 for true and 0 for false
We will label this data soon so remember this
Then we’ll determine the most influential features on the value being true
or false. This can reveal the interestng stuf!

Sanity check first: any good sleeps for Mr X?
Found 538 valid recorded sleep times and 129 were 8 or more
hours in duration
// Don't care if the sleep duration wasn't even recorded or it's 0
val onlyRecordedSleeps = onlyDurations.filter($"s_duration" > 0)
println("Found " + onlyRecordedSleeps.count() + " valid recorded " +
"sleep times and " + onlyGoodSleeps.count() + "
were " + NUM_HOURS + " or more hours in
duration")
THRESHOLD = 60 *
onlyGoodSleeps =
val onlyDurations = sleepData.select("s_duration")
val NUM_HOURS = 8
val
val
60 * NUM_HOURS
onlyDurations.filter($"s_duration" >= THRESHOLD)

We will use machine learning: but first...
1) What do we want to find out?
Main contributng factors to a good sleep
2) Pick an algorithm
3) Prepare the data
4) Separate into training and test data
5) Build a model with the training data (in parallel using Spark!)
6) Use that model on the test data
7) Evaluate the model
8) Experiment with parameters untl reasonably accurate e.g. N iteratons

Alternating Least Squares
K-means (unsupervised learning (no labels, cheap))
Classificaton algorithms such as
Clustering algorithms such as
●
Produce n clusters from data to determine which cluster a new item can be categorised as
●
Identfy anomalies: transaction fraud, erroneous data
Recommendaton algorithms such as
●
Movie recommendatons on Netlix?
●
Recommended purchases on Amazon?
●
Similar songs with Spotify?
●
Recommended videos on YouTube?
Logistic regression
●
Create model that we can use to predict where to plot the next item in a sequence (above or
below our line of best fit)
●
Healthcare: predict adverse drug reactons based on known interactons with similar drugs
●
Spam filter (binomial classification)
●
Naive Bayes
Which algorithms might be of use?

What does “Naive Bayes” have to do with
my sleep quality?
Using evidence provided, guess what a label will be (1 or 0) for
us: easy to use with some training data
0 = the label (category 0 or 1)
e.g. 0 = low scoring athlete, 1 = high scoring
1:x = the score for a sportng event 1
bayes_data.txt (libSVM format)

val model = new NaiveBayes().fit(trainingData)
val predictions = model.transform(testData)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test set accuracy = " + accuracy)
Test set accuracy = 0.82
val bayesData = sqlContext.read.format("libsvm").load("bayes_data.txt")
val Array(trainingData, testData) = bayesData.randomSplit(Array(0.7, 0.3))
Read it in, split it, fit it, transform and
evaluate – all on one slide with Spark!
https://spark.apache.org/docs/2.1.0/mllib-naive-bayes.html
Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive
Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each
feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation
and use it for prediction.

Naive Bayes correctly classifies the data (giving it the right labels)
Feed some new data in for the model...

“Can I just use Naive Bayes
on all of the sleep data?”

1) didn’t label each row in the
DataFrame yet
2) Naive Bayes can’t handle
our data in the current form
3) too many useless features

Possibilites – bear in mind that DataFrames are immutable, can't modify elements
directly...
1) Spark has a .map functon,howaboutthat?
“map is a transformation that passes each dataset element through a function and returns a new
RDD representing the results” - http://spark.apache.org/docs/latest/programming-guide.html
●
Removes allothercolumns inmycase...(newDataFrame withjustthelabels!)
2) Running a user defined functon on each row?
●
Maybe, but can Spark’s internal SQL optmiser “Catalyst” see
and optmise it? Probably slow
Labelling each row according to our “good
sleep” criteria

Preparing the labels
Preparing the features is easier
val labelledSleepData = sleepData.
withColumn("s_duration", when(col("s_duration") > THRESHOLD, 1).
otherwise(0))
val assembler = new VectorAssembler()
.setInputCols(sleepData.columns)
.setOutputCol("features")
val preparedData = assembler.transform(labelledSleepData).
withColumnRenamed("s_duration", "good_sleep")
“If duration is > 8 hours
good sleep = true, else false
I’m using 1 for true and 0 for false”

Trying to fit a model to the DataFrame now leads to...

s_asleep_time and n_bedtime (integers)
API docs: “Time user fell asleep. Seconds to/from midnight. If negative,
subtract from midnight. If positive, add to midnight”
Solution in this example?
Change to positives only
Add the number of seconds in a day to whatever s_asleep_time's
value is. Think it through properly when you try this if you’re done
experimenting and want something reliable to use!
The problem...

New DataFrame where negative values are handled
toModel.createOrReplaceTempView("to_model_table")
val preparedSleepAsLabel = preparedData.withColumnRenamed("good_sleep", "label")
val secondsInDay = 24 * 60 * 60
val toModel = preparedSleepAsLabel.
withColumn("s_asleep_time", (col("s_asleep_time")) + secondsInDay).
withColumn("s_bedtime", (col("s_bedtime")) + secondsInDay)

Reducing your “feature space”
Spark’s ChiSqSelector algorithm will work here
We want labels and features to inspect

val selector = new ChiSqSelector()
.setNumTopFeatures(10)
.setFeaturesCol("features")
.setLabelCol("good_sleep")
.setOutputCol("selected_features")
val model = selector.fit(preparedData)
val topFeatureIndexes = model.selectedFeatures
for (i <- 1 to topFeatureIndexes.length - 1) {
// Get col names based on feature indexes
println(preparedData.columns(topFeatureIndexes(i)))
}
Using ChiSq selector to get the top features
Feature selection tries to identify relevant features for use in model construction. It reduces the size of the feature space, which can improve both speed and
statistical learning behavior. ChiSqSelector implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses
the Chi-Squared test of independence to decide which features to choose. It supports three selection methods: numTopFeatures, percentile, fpr:
numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html#chisqselector

Transform values into a “features” column and
only select columns we identified as influential
Earlier we did...
toModel.createOrReplaceTempView("to_model_table")
val onlyInterestingColumns = sqlContext.sql("SELECT label, " + colNames.toString()
to_model_table")
+ " FROM
val theAssembler = new VectorAssembler()
.setInputCols(onlyInterestingColumns.columns)
.setOutputCol("features")
val thePreparedData = theAssembler.transform(onlyInterestingColumns)

Top ten influental features (most to least influental)
Feature Description from Jawbone API docs
s_count Number of primary sleep entries logged
s_awake_time Time the user woke up
s_quality Proprietary formula, don't know
s_asleep_time Time when the user fell asleep
s_bedtime Seconds the device is in sleep mode
s_deep Seconds of main “sound sleep”
s_light Seconds of “light sleeps” during the sleep period
m_workout_time Length of logged workouts in seconds
n_light Seconds of light sleep during the nap
n_sound Seconds of sound sleep during the nap

And after all that...we can generate predictions!
val Array(trainingSleepData, testSleepData)=thePreparedData.randomSplit(Array(0.7, 0.3)
val sleepModel = new NaiveBayes().fit(trainingSleepData)
val predictions = sleepModel.transform(testSleepData)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test set accuracy for labelled sleep data = " + accuracy)
Test set accuracy for labelled sleep data = 0.81 ...

Testing it with new data
val somethingNew = sqlContext.createDataFrame(Seq(
// Good sleep: high workout time, achieved a good amount of deep sleep, went to bed
after midnight and woke at almost noon!
(0, Vectors.dense(0, 1, 42600, 100, 87659, 85436, 16138, 22142, 4073, 0)),
// Bad sleep, woke up early (5 AM), didn't get much of a deep sleep, didn't workout,
bedtime 10.20 PM
(0, Vectors.dense(0, 0, 18925, 0, 80383, 80083, 6653, 17568, 0, 0))
)).toDF("label","features")
sleepModel.transform(somethingNew).show()

Sensible model created with outcomes we’d expect
Go to bed earlier, exercise more
I could have looked closer into removing the s_ variables so
they’re all m_ and diet informaton; exercise for the reader
Algorithms are producing these outcomes
without domain specific knowledge

Last example: “does weighing more result in a higher heart rate?”
Will get the average of all the heart rates logged on a day when
weight was measured
Lower heart rate day = weight was more?
Higher rate day = weight was less?
Maybe MLlib again? But all that preparation work...
How deeply involved with Spark do we usually
need to get?

More data preparaton needed, but there’s a twist
Here I use data from two tables: weights, activities
+----------+------+
| Date|weight|
+----------+------+
|2017-04-09|
|2017-04-08|
|2017-04-07|
220.4|
219.9|
221.0|+----------+------+
only showing top 3 rows
becomes
Times are removed as we only care about dates

Include only heart beat readings when we have
weight(s) measured: join on date used
+----------+------+----------------------+
| Date|weight|heart_beats_per_minute|
+----------+------+----------------------+
|2017-02-13|
|2017-02-13|
|2017-02-09|
|2017-02-09|
|2017-02-09|
220.3|
220.3|
215.9|
215.9|
215.9|
79.0|
77.0|
97.0|
104.0|
88.0|
+----------+------+----------------------
...

Average the rate and weight readings by day
+----------+------+----------------------+
| Date|weight|heart_beats_per_minute|
+----------+------+----------------------+
|2017-02-13| 220.3|
|2017-02-13| 220.7|
79.0|
77.0|
+----------+------+----------------------+
...
Should become this:
+----------+------+-----------------------------------+
| Date|avg weight |avg_heart_beats_per_minute |
+----------+------+-----------------------------------+
|2017-02-13| 220.5| 78 |
+----------+------+----------------------------------- +
...

DataFrame now looks like this...
+----------+--------------------------- +------------------+
|Date ||avg(heart_beats_per_minute)| avg(weight) |
+----------+----------------------------+------------------+
|2016-04-25|
|2017-01-06|
|2016-05-03|
|2016-07-26|
Something we can quickly plot!
|85.933... |196.46... |
|93.8125... |216.0 |
|83.647... |198.35... |
|84.411... |192.69... |

Bokeh used again, no more analysis required

Used the same functions as earlier (groupBy, formatting dates) and
also a join. Same plotting with different column names. No distinct
correlation identified so moved on
Still lots of questions we could answer with Spark using this data
●
Any impact on mpg when the driver weighs much less than before?
●
Which fuel provider gives me the best mpg?
●
Which visited places have a positive effect on subject’s weight?

●
Analytics doesn’t need to be complicated:
Spark’s good for the heavy lifting
●
Sometimes best to just plot as you go –
saves plenty of time
●
Other harder things to worry about
Writing a distributed machine learning
algorithm shouldn’t be one of them!

“Which tools can I use to answer
my questions?”
This question becomes easier

Infrastructure when you’re ready to scale beyond your laptop
●
Setting up a huge HA cluster: a talk on its own
●
Who sets up then maintains the machines? Automate it all?
●
How many machines do you need? RAM/CPUs?
●
Who ensures all software is up to date (CVEs?)
●
Access control lists?
●
Hosting costs/providers?
●
Reliability, fault tolerance, backup procedures...
Still got to think about...

●
Use GPUs to train models faster
●
DeepLearning4J?
●
Writing your own kernels/C/JNI code (or a Java API like CUDA4J/Aparapi?)
●
Use RDMA to reduce network transfer times
●
Zero copy: RoCE or InfiniBand?
●
Tune the JDK, the OS, the hardware
●
Continuously evaluate performance: Spark itself, use
●
-Xhealthcenter, your own metrics, various libraries...
●
Go tackle something huge – join the alien search
●
Combine Spark Streaming with MLlib to gain insights fast
●
More informed decision making
And if you want to really show off with Spark

●
Know more about Spark: what it can and can’t do (new
project ideas?)
●
Know more about machine learning in Spark
●
Know that machine learning’s stll hard but in diferent ways
Data preparaton, handling junk, knowing what to look for
Getting the data in the first place
Writng the algorithms to be used in Spark?
Recap – you should now...

●
Built-in Spark functons are aplenty – try and stck to these
●
You can plot your results by saving to a csv/json and using
your existng favourite plotting libraries easily
●
DataFrame (or Datasets) combined with ML = powerful APIs
●
Filter your data – decide how to handle nulls!
●
Pick and use a suitable ML algorithm
●
Plot results
Points to take home...

Final points to consider...
Where would Spark fit in to your systems? A replacement or
supplementary?
Give it a try with your own data and you might be surprised with
the outcome
It’s free and open source with a very actve community!
Contact me directly: aroberts@uk.ibm.com

●
Automatic: log into the Automatc Dashboard https://dashboard.automatc.com/,
on the bottom right, click export, choose what data you want to export (e.g. All)
●
Fuelly: (Obtained Gas Cubby), log into the Fuelly Dashboard http://www.fuelly.co
m/dashboard, select your vehicle in Your Garage, scroll down to vehicle logs,
select Export Fuel-ups or Export Services, select duraton of export
●
Jawbone: sign into your account at https://jawbone.com/, click on your name on
the top right, choose Settings, click on the Accounts tab, scroll down to Download
UP Data, choose which year you'd like to download data for
How did I access the data to process?

●
Withings: log into the Withings Dashboard https://healthmate.withings.com
click Measurement table, click the tab corresponding to the data you want
to export, click download. You can go here to download all data instead:
https://account.withings.com/export/
●
Apple: launch the Health app, navigate to the Health Data tab, select
your account in the top right area of your screen, select Export Health
Data
●
Remember to remove any sensitive personal information before
sharing/showing/storing said data elsewhere! I am dealing with
“cleansed” datasets with no SPI

DIY Analytics with Apache Spark

More Related Content

What's hot (20)

Similar to DIY Analytics with Apache Spark (20)

Recently uploaded (20)

DIY Analytics with Apache Spark