SlideShare a Scribd company logo
Location:
QuantUniversity Meetup
August 8th 2016
Boston MA
Scaling Analytics with Apache Spark
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2
Slides and Code will be available at:
http://www.analyticscertificate.com/SparkWorkshop/
- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers (Shell, Firstfuel Software etc.)
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
4
5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program in September
(MATLAB version also available)
7
Quantitative Analytics and Big Data Analytics Onboarding
• Apply at:
www.analyticscertificate.com
• Program starting September 18th
• Module 1:
▫ Sep 18th , 25th , Oct 2nd, 9th
• Module 2:
▫ Oct 16th , 23th , 30th, Nov 6th
• Module 3:
▫ Nov 13th, 20th, Dec 4th, Dec 11th
• Capstone + Certification Ceremony
▫ Dec 18th
8
• August
▫ 14-20th : ARPM in New York www.arpm.co
 QuantUniversity presenting on Model Risk on August 14th
▫ 18-21st : Big-data Bootcamp
http://globalbigdataconference.com/68/boston/big-data-
bootcamp/event.html
• September
▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house)
▫ 11th, 12th : Spark Workshop, Boston
▫ 19th, 20th : Anomaly Detection Workshop, New York
Events of Interest
9
Agenda
1. A quick introduction to Apache Spark
2. A sample Spark Program
3. Clustering using Apache Spark
4. Regression using Apache Spark
5. Simulation using Apache Spark
Apache Spark : Soaring in Popularity
Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008
What is Spark ?
• Apache Spark™ is a fast and general engine for large-scale data
processing.
• Came out of U.C. Berkeley’s AMP Lab
Lightning-fast cluster computing
Why Spark ?
Speed
Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.
Why Spark ?
• text_file =
spark.textFile("hdfs://...")
text_file.flatMap(lambda line: l
ine.split())
.map(lambda word: (word,
1))
.reduceByKey(lambda a, b:
a+b)
• Word count in Spark's Python
API
Ease of Use
• Write applications quickly in Java, Scala or
Python,R.
• Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you can
use it interactively from the Scala and Python
shells.
• R support recently added
Why Spark ?
• Generality
• Combine SQL, streaming, and
complex analytics.
• Spark powers a stack of high-level
tools including:
1. Spark Streaming: processing real-time
data streams
2. Spark SQL and DataFrames: support
for structured data and relational
queries
3. MLlib: built-in machine learning library
4. GraphX: Spark’s new API for graph
processing
Why Spark?
• Runs Everywhere
• Spark runs on Hadoop, Mesos,
standalone, or in the cloud. It can
access diverse data sources
including HDFS, Cassandra,
HBase, and S3.
• You can run Spark using
its standalone cluster mode,
on EC2, on Hadoop YARN, or
on Apache Mesos.
• Access data
in HDFS, Cassandra, HBase,
Hive, Tachyon, and any Hadoop
data source.
Key Features of Spark
• Handles batch, interactive, and real-time within a single
framework
• Native integration with Java, Python, Scala, R
• Programming at a higher level of abstraction
• More general: map/reduce is just one set of supported
constructs
Secret Sauce : RDD, Transformation, Action
How does it work?
• Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a
fault-tolerant collection of elements that can be operated on in parallel.
• Transformations create a new dataset from an existing one. All transformations
in Spark are lazy: they do not compute their results right away – instead they
remember the transformations applied to some base dataset.
• Actions return a value to the driver program after running a computation on
the dataset.
How is Spark different?
• Map – Reduce : Hadoop
Problems with this MR model
• Difficult to code
Getting started
• http://spark.apache.org/docs/latest/index.html
• http://datascience.ibm.com/
• https://community.cloud.databricks.com
Quick Demo
• Test_Notebook.ipyb
Machine learning with Spark
Machine learning with Spark
26
Machine learning with Spark
Use case 1 : Segmenting stocks
• If we have a basket of stocks and their price history, how do we
segment them into different clusters?
• What metrics could we use to measure similarity?
• Can we evaluate the effect of changing the number of clusters ?
• Do the results seem actionable?
K-means
Given a set of observations (x1, x2, …, xn), where each observation is
a d-dimensional real vector, k-means clustering aims to partition
the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize
the within-cluster sum of squares (WCSS). In other words, its objective
is to find:
where μi is the mean of points in Si.
http://shabal.in/visuals/kmeans/2.html
Demo
• Kmeans spark case.ipynb
Use-case 2 – Regression
• Given historical weekly interest data of AAA bond yields, 10 year
treasuries, 30 year treasuries and Federal fund rates, build a
regression model that fits
• Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)
Linear regression
• Linear regression investigates the linear relationships between variables and
predict one variable based on one or more other variables and it can be
formulated as:
𝑌 = 𝛽0 + ෍
𝑖=1
𝑝
𝛽𝑖 𝑋𝑖
where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a
constant.
• In this model, ordinary least squares estimator is usually used to minimize the
difference between the dependent variable and independent variables.
31
Ordinary Least Squares Regression
Demo
• Regression.ipyb
Scaling Monte-Carlo simulations
Example:
• Portfolio Growth
• Given:
▫ INVESTMENT_INIT = 100000 # starting amount
▫ INVESTMENT_ANN = 10000 # yearly new investment
▫ TERM = 30 # number of years
▫ MKT_AVG_RETURN = 0.11 # percentage
▫ MKT_STD_DEV = 0.18 # standard deviation
▫ Run 10000 monte-carlo simulation paths and compute the expected
value of the portfolio at the end of 30 years
Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark
36
• The count-distinct problem is the problem of finding the number of
distinct elements in a data stream with repeated elements.
• HyperLogLog is an algorithm for the count-distinct problem,
approximating the number of distinct elements in a multiset
• Calculating the exact cardinality of a multiset requires an amount of
memory proportional to the cardinality, which is impractical for very
large data sets. Probabilistic cardinality estimators, such as the
HyperLogLog algorithm, use significantly less memory than this, at
the cost of obtaining only an approximation of the cardinality.
Hyperloglog
Ref: https://en.wikipedia.org/wiki/HyperLogLog
37
Hyperloglog
The basis of the HyperLogLog algorithm is the observation
that the cardinality of a multiset of uniformly distributed
random numbers can be estimated by calculating the
maximum number of leading zeros in the binary
representation of each number in the set. If the maximum
number of leading zeros observed is n, an estimate for the
number of distinct elements in the set is 2^n
Ref: https://en.wikipedia.org/wiki/HyperLogLog
38
• Approximate algorithms
▫ approxCountDistinct: returns an estimate of the number of distinct
elements
▫ approxQuantile: returns approximate percentiles of numerical data
Refer:
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f17
3bcfc/8599738367597028/4196864626084292/3601578643761083/late
st.html
Demo from Databricks’s blog
39
• As per Databricks’s blog:
“Spark strives at implementing approximate algorithms that are
deterministic (they do not depend on random numbers to work) and that
have proven theoretical error bounds: for each algorithm, the user can
specify a target error bound, and the result is guaranteed to be within
this bound, either exactly (deterministic error bounds) or with very high
confidence (probabilistic error bounds)”
Spark’s implementation
Scaling Analytics with Apache Spark
41
www.analyticscertificate.com/SparkWorkshop
42
Q&A
Slides, code and details about the Apache Spark Workshop
at: http://www.analyticscertificate.com/SparkWorkshop/
Thank you!
Members & Sponsors!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
43

More Related Content

What's hot (20)

Ds for finance day 2
Ds for finance day 2
QuantUniversity
 
Deep learning QuantUniversity meetup
Deep learning QuantUniversity meetup
QuantUniversity
 
Nlp and Neural Networks workshop
Nlp and Neural Networks workshop
QuantUniversity
 
An Introduction to Anomaly Detection
An Introduction to Anomaly Detection
Kenneth Graham
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
Manojit Nandi
 
Feature Selection for Document Ranking
Feature Selection for Document Ranking
Andrea Gigli
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Musings of kaggler
Musings of kaggler
Kai Xin Thia
 
Biostatistics Workshop: Missing Data
Biostatistics Workshop: Missing Data
HopkinsCFAR
 
Machine Learning Algorithm & Anomaly detection 2021
Machine Learning Algorithm & Anomaly detection 2021
Chakrit Phain
 
Missing Data and data imputation techniques
Missing Data and data imputation techniques
Omar F. Althuwaynee
 
Explainable AI Workshop
Explainable AI Workshop
QuantUniversity
 
Machine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values
Salford Systems
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
José Ramón Ríos Viqueira
 
udacity-dandsyllabus
udacity-dandsyllabus
Bora Yüret
 
Andrew Bossy. Data Imputation Using Reverse ML
Andrew Bossy. Data Imputation Using Reverse ML
Lviv Startup Club
 
Statistical Approaches to Missing Data
Statistical Approaches to Missing Data
DataCards
 
Missing Data and Causes
Missing Data and Causes
akanni azeez olamide
 
Deep learning QuantUniversity meetup
Deep learning QuantUniversity meetup
QuantUniversity
 
Nlp and Neural Networks workshop
Nlp and Neural Networks workshop
QuantUniversity
 
An Introduction to Anomaly Detection
An Introduction to Anomaly Detection
Kenneth Graham
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
Manojit Nandi
 
Feature Selection for Document Ranking
Feature Selection for Document Ranking
Andrea Gigli
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Musings of kaggler
Musings of kaggler
Kai Xin Thia
 
Biostatistics Workshop: Missing Data
Biostatistics Workshop: Missing Data
HopkinsCFAR
 
Machine Learning Algorithm & Anomaly detection 2021
Machine Learning Algorithm & Anomaly detection 2021
Chakrit Phain
 
Missing Data and data imputation techniques
Missing Data and data imputation techniques
Omar F. Althuwaynee
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values
Salford Systems
 
Linear regression on 1 terabytes of data? Some crazy observations and actions
Linear regression on 1 terabytes of data? Some crazy observations and actions
Hesen Peng
 
Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
José Ramón Ríos Viqueira
 
udacity-dandsyllabus
udacity-dandsyllabus
Bora Yüret
 
Andrew Bossy. Data Imputation Using Reverse ML
Andrew Bossy. Data Imputation Using Reverse ML
Lviv Startup Club
 
Statistical Approaches to Missing Data
Statistical Approaches to Missing Data
DataCards
 

Viewers also liked (20)

Deep learning Tutorial - Part II
Deep learning Tutorial - Part II
QuantUniversity
 
Deep learning - Part I
Deep learning - Part I
QuantUniversity
 
Deep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
Model Risk Management : Best Practices
Model Risk Management : Best Practices
QuantUniversity
 
Missing data handling
Missing data handling
QuantUniversity
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
QuantUniversity
 
FitchLearning QuantUniversity Model Risk Presentation
FitchLearning QuantUniversity Model Risk Presentation
QuantUniversity
 
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...
QuantUniversity
 
Stormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and Pentaho
Dave Callaghan
 
Model Risk Management: Using an infinitely scalable stress testing platform f...
Model Risk Management: Using an infinitely scalable stress testing platform f...
QuantUniversity
 
Guest talk- Roof Classification
Guest talk- Roof Classification
QuantUniversity
 
Big data, Analytics and Beyond
Big data, Analytics and Beyond
QuantUniversity
 
A Framework Driven Approach to Model Risk Management (www.dataanalyticsfinanc...
A Framework Driven Approach to Model Risk Management (www.dataanalyticsfinanc...
QuantUniversity
 
Anomaly detection
Anomaly detection
QuantUniversity
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Sri Ambati
 
Scaling spark
Scaling spark
Alex Rovner
 
Debugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Sri Ambati
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part II
QuantUniversity
 
Deep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
Model Risk Management : Best Practices
Model Risk Management : Best Practices
QuantUniversity
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
QuantUniversity
 
FitchLearning QuantUniversity Model Risk Presentation
FitchLearning QuantUniversity Model Risk Presentation
QuantUniversity
 
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...
QuantUniversity
 
Stormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and Pentaho
Dave Callaghan
 
Model Risk Management: Using an infinitely scalable stress testing platform f...
Model Risk Management: Using an infinitely scalable stress testing platform f...
QuantUniversity
 
Guest talk- Roof Classification
Guest talk- Roof Classification
QuantUniversity
 
Big data, Analytics and Beyond
Big data, Analytics and Beyond
QuantUniversity
 
A Framework Driven Approach to Model Risk Management (www.dataanalyticsfinanc...
A Framework Driven Approach to Model Risk Management (www.dataanalyticsfinanc...
QuantUniversity
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Sri Ambati
 
Debugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Sri Ambati
 
Ad

Similar to Scaling Analytics with Apache Spark (20)

Technical_Report_on_ML_Library
Technical_Report_on_ML_Library
Saurabh Chauhan
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark ml streaming
Spark ml streaming
Adam Doyle
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
Graph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Microservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
04 open source_tools
04 open source_tools
Marco Quartulli
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Training Neural Networks
Training Neural Networks
Databricks
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_Library
Saurabh Chauhan
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
Spark Summit
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark ml streaming
Spark ml streaming
Adam Doyle
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
Graph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Microservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Training Neural Networks
Training Neural Networks
Databricks
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Big Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Ad

More from QuantUniversity (20)

AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
QuantUniversity
 
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
QuantUniversity
 
EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
QuantUniversity
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
QuantUniversity
 
Qu for India - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiser
QuantUniversity
 
Ml master class for CFA Dallas
Ml master class for CFA Dallas
QuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0
QuantUniversity
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
QuantUniversity
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk Management
QuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0
QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
Bayesian Portfolio Allocation
Bayesian Portfolio Allocation
QuantUniversity
 
The API Jungle
The API Jungle
QuantUniversity
 
Constructing Private Asset Benchmarks
Constructing Private Asset Benchmarks
QuantUniversity
 
Machine Learning Interpretability
Machine Learning Interpretability
QuantUniversity
 
Responsible AI in Action
Responsible AI in Action
QuantUniversity
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
QuantUniversity
 
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
QuantUniversity
 
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
QuantUniversity
 
EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !
QuantUniversity
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
QuantUniversity
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
QuantUniversity
 
Qu for India - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiser
QuantUniversity
 
Ml master class for CFA Dallas
Ml master class for CFA Dallas
QuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0
QuantUniversity
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
QuantUniversity
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
QuantUniversity
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk Management
QuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0
QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
QuantUniversity
 
Bayesian Portfolio Allocation
Bayesian Portfolio Allocation
QuantUniversity
 
Constructing Private Asset Benchmarks
Constructing Private Asset Benchmarks
QuantUniversity
 
Machine Learning Interpretability
Machine Learning Interpretability
QuantUniversity
 
Responsible AI in Action
Responsible AI in Action
QuantUniversity
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
QuantUniversity
 

Recently uploaded (20)

Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Data-Driven-Operational--Excellence.pptx
Data-Driven-Operational--Excellence.pptx
NiwanthaThilanjanaGa
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 

Scaling Analytics with Apache Spark

  • 1. Location: QuantUniversity Meetup August 8th 2016 Boston MA Scaling Analytics with Apache Spark 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com [email protected]
  • 2. 2 Slides and Code will be available at: http://www.analyticscertificate.com/SparkWorkshop/
  • 3. - Analytics Advisory services - Custom training programs - Architecture assessments, advice and audits
  • 4. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.) • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 4
  • 5. 5 Quantitative Analytics and Big Data Analytics Onboarding • Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R • Launching the Analytics Certificate Program in September
  • 6. (MATLAB version also available)
  • 7. 7 Quantitative Analytics and Big Data Analytics Onboarding • Apply at: www.analyticscertificate.com • Program starting September 18th • Module 1: ▫ Sep 18th , 25th , Oct 2nd, 9th • Module 2: ▫ Oct 16th , 23th , 30th, Nov 6th • Module 3: ▫ Nov 13th, 20th, Dec 4th, Dec 11th • Capstone + Certification Ceremony ▫ Dec 18th
  • 8. 8 • August ▫ 14-20th : ARPM in New York www.arpm.co  QuantUniversity presenting on Model Risk on August 14th ▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-data- bootcamp/event.html • September ▫ 1st : QuantUniversity Meetup (AnalyticsCertificate program open house) ▫ 11th, 12th : Spark Workshop, Boston ▫ 19th, 20th : Anomaly Detection Workshop, New York Events of Interest
  • 9. 9
  • 10. Agenda 1. A quick introduction to Apache Spark 2. A sample Spark Program 3. Clustering using Apache Spark 4. Regression using Apache Spark 5. Simulation using Apache Spark
  • 11. Apache Spark : Soaring in Popularity Ref: Wall street Journal http://www.wsj.com/articles/newer-software-aims-to-crunch-hadoops-numbers-1434326008
  • 12. What is Spark ? • Apache Spark™ is a fast and general engine for large-scale data processing. • Came out of U.C. Berkeley’s AMP Lab Lightning-fast cluster computing
  • 13. Why Spark ? Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
  • 14. Why Spark ? • text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: l ine.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) • Word count in Spark's Python API Ease of Use • Write applications quickly in Java, Scala or Python,R. • Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells. • R support recently added
  • 15. Why Spark ? • Generality • Combine SQL, streaming, and complex analytics. • Spark powers a stack of high-level tools including: 1. Spark Streaming: processing real-time data streams 2. Spark SQL and DataFrames: support for structured data and relational queries 3. MLlib: built-in machine learning library 4. GraphX: Spark’s new API for graph processing
  • 16. Why Spark? • Runs Everywhere • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. • You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. • Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.
  • 17. Key Features of Spark • Handles batch, interactive, and real-time within a single framework • Native integration with Java, Python, Scala, R • Programming at a higher level of abstraction • More general: map/reduce is just one set of supported constructs
  • 18. Secret Sauce : RDD, Transformation, Action
  • 19. How does it work? • Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel. • Transformations create a new dataset from an existing one. All transformations in Spark are lazy: they do not compute their results right away – instead they remember the transformations applied to some base dataset. • Actions return a value to the driver program after running a computation on the dataset.
  • 20. How is Spark different? • Map – Reduce : Hadoop
  • 21. Problems with this MR model • Difficult to code
  • 22. Getting started • http://spark.apache.org/docs/latest/index.html • http://datascience.ibm.com/ • https://community.cloud.databricks.com
  • 27. Use case 1 : Segmenting stocks • If we have a basket of stocks and their price history, how do we segment them into different clusters? • What metrics could we use to measure similarity? • Can we evaluate the effect of changing the number of clusters ? • Do the results seem actionable?
  • 28. K-means Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find: where μi is the mean of points in Si. http://shabal.in/visuals/kmeans/2.html
  • 29. Demo • Kmeans spark case.ipynb
  • 30. Use-case 2 – Regression • Given historical weekly interest data of AAA bond yields, 10 year treasuries, 30 year treasuries and Federal fund rates, build a regression model that fits • Changes to AAA = function of (Changes to 10year rates, Changes to 30 year rates, Changes to FF rates)
  • 31. Linear regression • Linear regression investigates the linear relationships between variables and predict one variable based on one or more other variables and it can be formulated as: 𝑌 = 𝛽0 + ෍ 𝑖=1 𝑝 𝛽𝑖 𝑋𝑖 where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is a constant. • In this model, ordinary least squares estimator is usually used to minimize the difference between the dependent variable and independent variables. 31
  • 35. Example: • Portfolio Growth • Given: ▫ INVESTMENT_INIT = 100000 # starting amount ▫ INVESTMENT_ANN = 10000 # yearly new investment ▫ TERM = 30 # number of years ▫ MKT_AVG_RETURN = 0.11 # percentage ▫ MKT_STD_DEV = 0.18 # standard deviation ▫ Run 10000 monte-carlo simulation paths and compute the expected value of the portfolio at the end of 30 years Ref: https://cloud.google.com/solutions/monte-carlo-methods-with-hadoop-spark
  • 36. 36 • The count-distinct problem is the problem of finding the number of distinct elements in a data stream with repeated elements. • HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset • Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. Hyperloglog Ref: https://en.wikipedia.org/wiki/HyperLogLog
  • 37. 37 Hyperloglog The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2^n Ref: https://en.wikipedia.org/wiki/HyperLogLog
  • 38. 38 • Approximate algorithms ▫ approxCountDistinct: returns an estimate of the number of distinct elements ▫ approxQuantile: returns approximate percentiles of numerical data Refer: https://databricks-prod- cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f17 3bcfc/8599738367597028/4196864626084292/3601578643761083/late st.html Demo from Databricks’s blog
  • 39. 39 • As per Databricks’s blog: “Spark strives at implementing approximate algorithms that are deterministic (they do not depend on random numbers to work) and that have proven theoretical error bounds: for each algorithm, the user can specify a target error bound, and the result is guaranteed to be within this bound, either exactly (deterministic error bounds) or with very high confidence (probabilistic error bounds)” Spark’s implementation
  • 42. 42 Q&A Slides, code and details about the Apache Spark Workshop at: http://www.analyticscertificate.com/SparkWorkshop/
  • 43. Thank you! Members & Sponsors! Sri Krishnamurthy, CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 43