SlideShare a Scribd company logo
Energy Analytics
with Spark
SRI KRISHNAMURTHY
QUANTUNIVERSITY LLC.
QUANTUNIVERSITY@GMAIL.COM
2015 Copyright QuantUniversity LLC.
- ADVISORY SERVICES
- CUSTOM TRAINING PROGRAMS
- PLATFORM FOR LARGE SCALE SIMULATIONS AND ANALYTICS
- ARCHITECTURE REVIEW, TRAINING AND AUDITS
SOON ANALYTICS CERTIFICATE PROGRAM!
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University
Sri Krishnamurthy
Founder and CEO
SPEAKERBIO
Agenda
1. Energy Analytics 101
2. A quick introduction to Apache Spark
3. Fundamentals and setup
4. Energy Analytics use-cases
Files for today’s workshop
http://bit.ly/1F6E91O
What’s missing in this picture ?
Analytics of the 1990s
Actionable analytics enables engagement!
1. What if energy companies can understand
customer’s usage better?
2. What if energy companies can understand
what drives customer energy usage ?
3. What if energy companies engage with
customers by providing actionable
analytics so that customers can monitor
and plan for wide swings in energy usage?
Customer demand isn’t uniform
What changed from Jan 2nd to Jan 3rd ?
Does this customer heavily use energy from 11-9pm ?
Problem
1. Lots of data
2. Utilities can have more than 100K customers.
3. Truly a big data problem with multiple dimensions and dirty data
Use case 1 : Customer Segmentation
Segmenting customers
- How can we segment customers into groups ?
- Typically Clustering algorithms like K-means are used
What is K-means ?
http://shabal.in/visuals/kmeans/2.html
Use-case 2 – Load Forecasting
Given parameters like Temperature, day of week, month, time of day, can we predict load ?
Typically methods like Regression are used
Load = Function of (Temperature, day of week, month, time of day etc)
What is Spark ?
Apache Spark™ is a fast and general engine for large-scale data processing.
Came out of U.C. Berkeley’s AMP Lab
Lightning-fast cluster computing
Why Spark ?
Speed
Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.
Why Spark ?
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
Word count in Spark's Python API
Ease of Use
• Write applications quickly in Java, Scala or
Python,R.
• Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you can
use it interactively from the Scala and Python
shells.
• R support recently added
Why Spark ?
Generality
Combine SQL, streaming, and complex
analytics.
Spark powers a stack of high-level tools
including:
1. Spark Streaming: processing real-time data
streams
2. Spark SQL and DataFrames: support for
structured data and relational queries
3. MLlib: built-in machine learning library
4. GraphX: Spark’s new API for graph
processing
Why Spark?
Runs Everywhere
Spark runs on Hadoop, Mesos,
standalone, or in the cloud. It can access
diverse data sources including HDFS,
Cassandra, HBase, and S3.
You can run Spark using its standalone
cluster mode, on EC2, on Hadoop YARN,
or on Apache Mesos. Access data
in HDFS, Cassandra, HBase, Hive,Tachyo
n, and any Hadoop data source.
Key Features of Spark
• handles batch, interactive, and real-time within a single
framework
• native integration with Java, Python, Scala, R
• programming at a higher level of abstraction
• more general: map/reduce is just one set of supported
constructs
Secret Sauce : RDD, Transformation,
Action
How does it work?
Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a
fault-tolerant collection of elements that can be operated on in parallel
Transformations create a new dataset from an existing one. All transformations
in Spark are lazy: they do not compute their results right away – instead they
remember the transformations applied to some base dataset
Actions return a value to the driver program after running a computation on the
dataset
How is Spark different?
Map – Reduce : Hadoop
Problems with this MR model
Difficult to code
Quick Demo
Test_Notebook.ipyb
Spark Setup
1. Get Spark 1.5 from http://spark.apache.org/
Select the latest Spark release (1.5), a prebuilt package for Hadoop 2.6, and download directly.
Windows
TEST_SPARK_ENV = C:spark-1.5.0-bin-hadoop2.6
SPARK_HOME = % TEST_SPARK_ENV%
Add % TEST_SPARK_ENV%bin to path
Unix/MAC
export SPARK_HOME=/srv/spark
export PATH=$SPARK_HOME/bin:$PATH
Set up Ipython Notebook
Install the latest Anaconda install from https://store.continuum.io/cshop/anaconda/
ipython profile create spark1.5.0
Go to:
C:Userssri-dell.ipythonprofile_spark1.5.0startup and create a file:
00-spark1.5.0-setup.py
Add this
Start Ipython Notebook (Jupyter)
cd c:spark_workshop
ipython notebook --profile=spark1.5.0
Example 1
Test_Notebook.ipyb
Appendix
In Windows, you may see this:
See https://issues.apache.org/jira/browse/SPARK-2356
To resolve this:
Get winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and
put it in SPARK_HOME
Add the following environment variables
set HADOOP_HOME=%SPARK_HOME%
set HADOOP_CONF_DIR=%SPARK_HOME%
To reduce logging messages
Copy
%SPARK_HOME%/conf/log4j.properties.template
to
%SPARK_HOME%/conf/log4j.properties
Replace INFO to WARN
Key Features used in Spark
1. Dataframes
◦ http://spark.apache.org/docs/latest/sql-programming-guide.html
2. Pyspark.ml API
◦ http://spark.apache.org/docs/latest/api/python/index.html
Use case 1 : Customer Segmentation
Segmenting customers
- How can we segment customers into groups ?
- Typically Clustering algorithms like K-means are used
What is K-means ?
http://shabal.in/visuals/kmeans/2.html
K-means
Given a set of observations (x1, x2, …, xn), where each
observation is a d-dimensional real vector, k-means clustering
aims to partition the n observations into k (≤ n)
sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of
squares (WCSS). In other words, its objective is to find:
where μi is the mean of points in Si.
Data Wrangling
and Exploration
Clustering
Plotting for
Exploration
Clustering approach
Demo
Data
http://en.openei.org/datasets/dataset?tags=Residential
http://en.openei.org/datasets/dataset/doe-buildings-performance-database-sample-
residential-data
Goal : To segment customers to 3-5 segments based on similarity
K-means - Customer segmentation.ipynb
Use-case 2 – Load Forecasting
Given parameters like Temperature, day of week, month, time of day, can we predict load ?
Typically methods like Regression are used
Load = Function of (Temperature, day of week, month, time of day etc)
Ordinary Least Squares Regression
Machine Learning Model
Piecewise Linear Model
50 different factors to use
Like temperature, holidays
Weekday/weekend etc
Data Wrangling
and Exploration
Machine Learning
for forecasting
Plotting for
Exploration
Linear Regression
248 unique customers
30 million records
5 minute intervals
35 different subindustries
US, Canada and Australia
Data Sources
Temperature data for more than 100 weather stations corresponding
To the sites
Enernoc dataset
248 unique customers
30 million records
5 minute intervals
35 different subindustries
US, Canada and Australia
Demo
Data
Forecast.csv
Goal : To build a Linear Regression model to enable load forecasts given day,hour,month and
temperature
Load Forecasting-Dataframe.ipyb
Other Energy sources
Utilities
http://www.sdge.com/documents/green-button-60-min-meter-interval-sample-data-csv
http://www.sdge.com/customer-service/green-button/additional-information-developers-and-
third-parties
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
References
1. Spark documentation and http://spark.apache.org/
2. Spark presentations primarily by Spark founders and the Databricks team
Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
www.analyticscertificate.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

More Related Content

PDF
Big data, Analytics and Beyond
PDF
Deep learning and Apache Spark
PDF
Big Data is changing abruptly, and where it is likely heading
PPTX
Machine Learning and Hadoop
PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
PDF
SparkML: Easy ML Productization for Real-Time Bidding
PDF
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
PPTX
Machine Learning with Hadoop
Big data, Analytics and Beyond
Deep learning and Apache Spark
Big Data is changing abruptly, and where it is likely heading
Machine Learning and Hadoop
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
SparkML: Easy ML Productization for Real-Time Bidding
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Machine Learning with Hadoop

What's hot (20)

PPTX
Graph Based Machine Learning on Relational Data
PPTX
Apache Spark Machine Learning Decision Trees
PPTX
Machine Learning with Spark
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
PDF
Technical_Report_on_ML_Library
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Graph Analytics in Spark
PDF
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
PDF
Predicting Flight Delays with Spark Machine Learning
PDF
Introduction to machine learning with GPUs
PPTX
Cloud Services for Big Data Analytics
PPTX
High Performance Processing of Streaming Data
PPTX
Matching Data Intensive Applications and Hardware/Software Architectures
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PPTX
High Performance Data Analytics with Java on Large Multicore HPC Clusters
PDF
DASK and Apache Spark
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Graph Based Machine Learning on Relational Data
Apache Spark Machine Learning Decision Trees
Machine Learning with Spark
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Deep Learning + others for Spark Meetup
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Technical_Report_on_ML_Library
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Graph Analytics in Spark
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Predicting Flight Delays with Spark Machine Learning
Introduction to machine learning with GPUs
Cloud Services for Big Data Analytics
High Performance Processing of Streaming Data
Matching Data Intensive Applications and Hardware/Software Architectures
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
High Performance Data Analytics with Java on Large Multicore HPC Clusters
DASK and Apache Spark
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Ad

Similar to Energy analytics with Apache Spark workshop (20)

PPTX
Predictive maintenance withsensors_in_utilities_
PDF
Scaling Analytics with Apache Spark
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
PDF
Apache Spark Overview @ ferret
PPTX
Spark-Zeppelin-ML on HWX
PDF
Bds session 13 14
PDF
Apache Spark Tutorial
PPTX
In Memory Analytics with Apache Spark
PDF
Dev Ops Training
PPTX
Machine Learning with Apache Spark
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PPTX
Spark Summit EMEA - Arun Murthy's Keynote
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
PDF
Spark forplainoldjavageeks svforum_20140724
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
SparkNotes
Predictive maintenance withsensors_in_utilities_
Scaling Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Intro to Apache Spark by CTO of Twingo
Yarn spark next_gen_hadoop_8_jan_2014
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Apache Spark Overview @ ferret
Spark-Zeppelin-ML on HWX
Bds session 13 14
Apache Spark Tutorial
In Memory Analytics with Apache Spark
Dev Ops Training
Machine Learning with Apache Spark
Spark SQL versus Apache Drill: Different Tools with Different Rules
Big Data Analytics with Storm, Spark and GraphLab
Spark Summit EMEA - Arun Murthy's Keynote
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark forplainoldjavageeks svforum_20140724
Unit II Real Time Data Processing tools.pptx
SparkNotes
Ad

More from QuantUniversity (20)

PDF
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
PDF
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
PDF
EU Artificial Intelligence Act 2024 passed !
PDF
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
PDF
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PDF
Qu for India - QuantUniversity FundRaiser
PDF
Ml master class for CFA Dallas
PDF
Algorithmic auditing 1.0
PDF
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
PDF
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
PDF
Seeing what a gan cannot generate: paper review
PDF
AI Explainability and Model Risk Management
PDF
Algorithmic auditing 1.0
PDF
Machine Learning in Finance: 10 Things You Need to Know in 2021
PDF
Bayesian Portfolio Allocation
PDF
The API Jungle
PDF
Explainable AI Workshop
PDF
Constructing Private Asset Benchmarks
PDF
Machine Learning Interpretability
PDF
Responsible AI in Action
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
EU Artificial Intelligence Act 2024 passed !
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
Qu for India - QuantUniversity FundRaiser
Ml master class for CFA Dallas
Algorithmic auditing 1.0
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Seeing what a gan cannot generate: paper review
AI Explainability and Model Risk Management
Algorithmic auditing 1.0
Machine Learning in Finance: 10 Things You Need to Know in 2021
Bayesian Portfolio Allocation
The API Jungle
Explainable AI Workshop
Constructing Private Asset Benchmarks
Machine Learning Interpretability
Responsible AI in Action

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PPTX
Computer network topology notes for revision
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Managing Community Partner Relationships
PDF
Introduction to the R Programming Language
Transcultural that can help you someday.
Computer network topology notes for revision
ISS -ESG Data flows What is ESG and HowHow
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
Introduction-to-Cloud-ComputingFinal.pptx
Predictive modeling basics in data cleaning process
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Managing Community Partner Relationships
Introduction to the R Programming Language

Energy analytics with Apache Spark workshop

  • 1. Energy Analytics with Spark SRI KRISHNAMURTHY QUANTUNIVERSITY LLC. [email protected] 2015 Copyright QuantUniversity LLC.
  • 2. - ADVISORY SERVICES - CUSTOM TRAINING PROGRAMS - PLATFORM FOR LARGE SCALE SIMULATIONS AND ANALYTICS - ARCHITECTURE REVIEW, TRAINING AND AUDITS SOON ANALYTICS CERTIFICATE PROGRAM!
  • 3. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services customers • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University Sri Krishnamurthy Founder and CEO SPEAKERBIO
  • 4. Agenda 1. Energy Analytics 101 2. A quick introduction to Apache Spark 3. Fundamentals and setup 4. Energy Analytics use-cases Files for today’s workshop http://bit.ly/1F6E91O
  • 5. What’s missing in this picture ? Analytics of the 1990s
  • 6. Actionable analytics enables engagement! 1. What if energy companies can understand customer’s usage better? 2. What if energy companies can understand what drives customer energy usage ? 3. What if energy companies engage with customers by providing actionable analytics so that customers can monitor and plan for wide swings in energy usage?
  • 7. Customer demand isn’t uniform What changed from Jan 2nd to Jan 3rd ? Does this customer heavily use energy from 11-9pm ?
  • 8. Problem 1. Lots of data 2. Utilities can have more than 100K customers. 3. Truly a big data problem with multiple dimensions and dirty data
  • 9. Use case 1 : Customer Segmentation Segmenting customers - How can we segment customers into groups ? - Typically Clustering algorithms like K-means are used What is K-means ? http://shabal.in/visuals/kmeans/2.html
  • 10. Use-case 2 – Load Forecasting Given parameters like Temperature, day of week, month, time of day, can we predict load ? Typically methods like Regression are used Load = Function of (Temperature, day of week, month, time of day etc)
  • 11. What is Spark ? Apache Spark™ is a fast and general engine for large-scale data processing. Came out of U.C. Berkeley’s AMP Lab Lightning-fast cluster computing
  • 12. Why Spark ? Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
  • 13. Why Spark ? text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) Word count in Spark's Python API Ease of Use • Write applications quickly in Java, Scala or Python,R. • Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells. • R support recently added
  • 14. Why Spark ? Generality Combine SQL, streaming, and complex analytics. Spark powers a stack of high-level tools including: 1. Spark Streaming: processing real-time data streams 2. Spark SQL and DataFrames: support for structured data and relational queries 3. MLlib: built-in machine learning library 4. GraphX: Spark’s new API for graph processing
  • 15. Why Spark? Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive,Tachyo n, and any Hadoop data source.
  • 16. Key Features of Spark • handles batch, interactive, and real-time within a single framework • native integration with Java, Python, Scala, R • programming at a higher level of abstraction • more general: map/reduce is just one set of supported constructs
  • 17. Secret Sauce : RDD, Transformation, Action
  • 18. How does it work? Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel Transformations create a new dataset from an existing one. All transformations in Spark are lazy: they do not compute their results right away – instead they remember the transformations applied to some base dataset Actions return a value to the driver program after running a computation on the dataset
  • 19. How is Spark different? Map – Reduce : Hadoop
  • 20. Problems with this MR model Difficult to code
  • 22. Spark Setup 1. Get Spark 1.5 from http://spark.apache.org/ Select the latest Spark release (1.5), a prebuilt package for Hadoop 2.6, and download directly. Windows TEST_SPARK_ENV = C:spark-1.5.0-bin-hadoop2.6 SPARK_HOME = % TEST_SPARK_ENV% Add % TEST_SPARK_ENV%bin to path Unix/MAC export SPARK_HOME=/srv/spark export PATH=$SPARK_HOME/bin:$PATH
  • 23. Set up Ipython Notebook Install the latest Anaconda install from https://store.continuum.io/cshop/anaconda/ ipython profile create spark1.5.0 Go to: C:Userssri-dell.ipythonprofile_spark1.5.0startup and create a file: 00-spark1.5.0-setup.py
  • 25. Start Ipython Notebook (Jupyter) cd c:spark_workshop ipython notebook --profile=spark1.5.0 Example 1 Test_Notebook.ipyb
  • 26. Appendix In Windows, you may see this: See https://issues.apache.org/jira/browse/SPARK-2356
  • 27. To resolve this: Get winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and put it in SPARK_HOME Add the following environment variables set HADOOP_HOME=%SPARK_HOME% set HADOOP_CONF_DIR=%SPARK_HOME%
  • 28. To reduce logging messages Copy %SPARK_HOME%/conf/log4j.properties.template to %SPARK_HOME%/conf/log4j.properties Replace INFO to WARN
  • 29. Key Features used in Spark 1. Dataframes ◦ http://spark.apache.org/docs/latest/sql-programming-guide.html 2. Pyspark.ml API ◦ http://spark.apache.org/docs/latest/api/python/index.html
  • 30. Use case 1 : Customer Segmentation Segmenting customers - How can we segment customers into groups ? - Typically Clustering algorithms like K-means are used What is K-means ? http://shabal.in/visuals/kmeans/2.html
  • 31. K-means Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find: where μi is the mean of points in Si.
  • 32. Data Wrangling and Exploration Clustering Plotting for Exploration Clustering approach
  • 34. Use-case 2 – Load Forecasting Given parameters like Temperature, day of week, month, time of day, can we predict load ? Typically methods like Regression are used Load = Function of (Temperature, day of week, month, time of day etc)
  • 36. Machine Learning Model Piecewise Linear Model 50 different factors to use Like temperature, holidays Weekday/weekend etc
  • 37. Data Wrangling and Exploration Machine Learning for forecasting Plotting for Exploration Linear Regression
  • 38. 248 unique customers 30 million records 5 minute intervals 35 different subindustries US, Canada and Australia Data Sources Temperature data for more than 100 weather stations corresponding To the sites
  • 39. Enernoc dataset 248 unique customers 30 million records 5 minute intervals 35 different subindustries US, Canada and Australia
  • 40. Demo Data Forecast.csv Goal : To build a Linear Regression model to enable load forecasts given day,hour,month and temperature Load Forecasting-Dataframe.ipyb
  • 44. References 1. Spark documentation and http://spark.apache.org/ 2. Spark presentations primarily by Spark founders and the Databricks team
  • 45. Thank you! Sri Krishnamurthy, CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com www.analyticscertificate.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC.