SlideShare a Scribd company logo
1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete
Pivotal Data Scientist
Structure Data 2016
Data Science at Scale on MPP
Databases – Use Cases & Open Source
Tools
Joint work with Pivotal Data Science
2© Copyright 2016 Pivotal. All rights reserved.
Agenda
Ÿ  Introduction
Ÿ  Open Source Data Science Toolkit
Ÿ  Real world applications
–  Predictive maintenance of automobiles
–  Predicting insurance claims
–  Predicting customer churn
Ÿ  Data science deep-dive with Jupyter notebooks
–  Text analytics on MPP (github.com/vatsan)
–  Image processing on MPP (github.com/gautamsm)
3© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science
Our Charter:
Pivotal Data Science is Pivotal’s differentiated and
highly opinionated data-centric service delivery
organization (part of Pivotal Labs)
Our Goals:
Expedite customer time-to-value and ROI, by driving
business-aligned innovation and solutions assurance
within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full
spectrum of Pivotal Data technologies through best-in-
class data science and data engineering services, with
a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
4© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
5© Copyright 2016 Pivotal. All rights reserved.
Use Case: Preventive Maintenance for
Connected Vehicles
Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
Ÿ  Can the DTC data be leveraged to predict the presence of
potential problems in vehicles?
Ÿ  Set up a data science framework on the Pivotal analytics
environment that would enable the customer data science
team to continuously monitor problems in their vehicles
using DTC data
6© Copyright 2016 Pivotal. All rights reserved.
Problem Setup – Predicting Job Type from
Diagnostic Trouble Codes (DTCs)
Time
Job Type:
Transmission
Job Type:
Transmission
Engine
Job Type:
Body
DTC: B DTC:
B,
P, C
DTC: U
DTC: B DTC: B
DTC:
B, P, C, U
DTC:
P, B, U
DTC: P DTC: B DTC:
B,P
DTC:
B,P
Can the DTCs
observed here predict
this Job Type?
Can the DTCs observed
here predict this Job
Type?
Can the DTCs observed
here predict this Job
Type?
7© Copyright 2016 Pivotal. All rights reserved.
Data Parallelism
One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs.
Non Red
On Segment 1
Green vs.
Non Green
On Segment 2
Blue vs.
Non Blue
On Segment N
8© Copyright 2016 Pivotal. All rights reserved.
Model Scoring Pipeline
DTC: B DTC: B, P, C DTC: U
Body
Axle
Engine
Prob >=
Threshold
Prob >=
Threshold
Prob >=
Threshold
Model Caching
(GPDB/
HAWQ)
Real time
scoring
web or mobile app dashboard
Ingest
Sink
9© Copyright 2016 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by
a particular field (or randomly)
10© Copyright 2016 Pivotal. All rights reserved.
IT TAKES MORE THAN
ONE TOOL
11© Copyright 2016 Pivotal. All rights reserved.
Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
ModelingTools
VisualizationTools
Platform
GemFire
12© Copyright 2016 Pivotal. All rights reserved.
Scalable, In-Database
Machine Learning
•  Open Source https://github.com/madlib/madlib
•  Works on Greenplum DB, Apache HAWQ and PostgreSQL
•  In active development by Pivotal
•  MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)
13© Copyright 2016 Pivotal. All rights reserved.
Functions
Supervised Learning
Regression Models
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Marginal Effects
•  Multinomial Regression
•  Ordinal Regression
•  Robust Variance, Clustered Variance
•  Support Vector Machines
Tree Methods
•  Decision Tree
•  Random Forest
Other Methods
•  Conditional Random Field
•  Naïve Bayes
Unsupervised Learning
•  Association Rules (Apriori)
•  Clustering (K-means)
•  Topic Modeling (LDA)
Statistics
Descriptive
•  Cardinality Estimators
•  Correlation
•  Summary
Inferential
•  Hypothesis Tests
Other Statistics
•  Probability Functions
Other Modules
•  Conjugate Gradient
•  Linear Solvers
•  PMML Export
•  Random Sampling
•  Term Frequency for Text
Time Series
•  ARIMA
Aug 2015
Data Types and Transformations
•  Array Operations
•  Dimensionality Reduction (PCA)
•  Encoding Categorical Variables
•  Matrix Operations
•  Matrix Factorization (SVD, Low Rank)
•  Norms and Distance Functions
•  Sparse Vectors
Model Evaluation
•  Cross Validation
Predictive Analytics Library
@MADlib_analytic
14© Copyright 2016 Pivotal. All rights reserved.
Use Case: Predicting insurance claim amounts
using structured and unstructured data
Ÿ  Using features from structured and unstructured data
sources associated with claims, build the capability to
predict claim amounts
15© Copyright 2016 Pivotal. All rights reserved.
Text analytics on MPP
Ÿ  Unstructured data in the
form of claim comments and
claim descriptions (text)
Ÿ  Use a bag-of-words
approach (unigrams,
bigrams)
Ÿ  tf-idf for more meaningful
insights
16© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through
this Jupyter
notebook
17© Copyright 2016 Pivotal. All rights reserved.
Use Case: Churn prediction
Ÿ  Build a churn model to predict
which customers are most likely
to churn
Ÿ  Provide insights into key factors
responsible for churn to
potentially intervene prior to
churn
18© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data
Ÿ  Aggregate weekly usage by user
Ÿ  Compute descriptive statistics
Ÿ  Extract features based on business expertise
19© Copyright 2016 Pivotal. All rights reserved.
Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for
building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
20© Copyright 2016 Pivotal. All rights reserved.
•  For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
•  The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X : X in Python, R, Java,
C/C++ and pgSQL
•  plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)
21© Copyright 2016 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE	
  FUNCTION	
  seasonality	
  (x	
  float[])	
  
	
  	
  RETURNS	
  float[]	
  
AS	
  $$	
  
	
  	
  import	
  statsmodels.api	
  as	
  sm	
  
	
  	
  s	
  =	
  sm.tsa.seasonal_decompose(x).seasonal	
  	
  
	
  	
  return	
  s	
  
$$	
  LANGUAGE	
  plpythonu;	
  
SQL wrapper
SQL wrapper
Normal Python
22© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data with PL/X
Ÿ  Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
Ÿ  Runs at scale through data parallelism
23© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV
inside a PL/C function
24© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1.  Scaling native (C++) apps on Pivotal MPP
2.  Predicting commodity futures through Tweets
3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4.  Using data science to predict TV viewer behavior
5.  Twitter NLP: Scaling part-of-speech tagging
6.  Distributed deep learning on MPP and Hadoop
7.  Multi-variate time series forecasting
8.  Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal
25© Copyright 2016 Pivotal. All rights reserved.
Thank You!
A NEW PLATFORM FOR A NEW ERA

More Related Content

PDF
SQL and Machine Learning on Hadoop using HAWQ
PDF
MPP vs Hadoop
PDF
Performance tuning your Hadoop/Spark clusters to use cloud storage
PPTX
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
PPTX
Graphene – Microsoft SCOPE on Tez
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
PPTX
Node Labels in YARN
PDF
Greenplum Architecture
SQL and Machine Learning on Hadoop using HAWQ
MPP vs Hadoop
Performance tuning your Hadoop/Spark clusters to use cloud storage
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Graphene – Microsoft SCOPE on Tez
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Node Labels in YARN
Greenplum Architecture

What's hot (20)

PPTX
Apache Hadoop 3.0 Community Update
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PPTX
Deep Learning using Spark and DL4J for fun and profit
PDF
Pivotal HAWQ 소개
PPTX
Empower Data-Driven Organizations
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PDF
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
PPTX
Empower Data-Driven Organizations with HPE and Hadoop
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
PPTX
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
PPTX
Apache HAWQ and Apache MADlib: Journey to Apache
PPTX
Hadoop 3 in a Nutshell
PPTX
Big Data in the Cloud - The What, Why and How from the Experts
PPTX
A machine learning and data science pipeline for real companies
PDF
Interactive SQL-on-Hadoop and JethroData
PPTX
HPE Keynote Hadoop Summit San Jose 2016
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
Apache Hadoop 3.0 Community Update
The columnar roadmap: Apache Parquet and Apache Arrow
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Deep Learning using Spark and DL4J for fun and profit
Pivotal HAWQ 소개
Empower Data-Driven Organizations
Efficient Data Formats for Analytics with Parquet and Arrow
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Empower Data-Driven Organizations with HPE and Hadoop
Dancing elephants - efficiently working with object stores from Apache Spark ...
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
Hadoop 3 @ Hadoop Summit San Jose 2017
Scaling Deep Learning on Hadoop at LinkedIn
Apache HAWQ and Apache MADlib: Journey to Apache
Hadoop 3 in a Nutshell
Big Data in the Cloud - The What, Why and How from the Experts
A machine learning and data science pipeline for real companies
Interactive SQL-on-Hadoop and JethroData
HPE Keynote Hadoop Summit San Jose 2016
The columnar roadmap: Apache Parquet and Apache Arrow
Ad

Similar to Data Science at Scale on MPP databases - Use Cases & Open Source Tools (20)

PPTX
All thingspython@pivotal
PDF
Pivotal OSS meetup - MADlib and PivotalR
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
PDF
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
PDF
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
PDF
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PDF
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
PDF
Python Powered Data Science at Pivotal (PyData 2013)
PPTX
Data Science At Scale for IoT on the Pivotal Platform
PDF
Pivotal data science_data_engineering_secret_weapons_of_the_strategic_enterprise
 
PPTX
Azure Databricks for Data Scientists
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PDF
Opportunities for data analytics in power generation affelt 2016
PDF
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
PDF
Xmplr power gen natgas 2016 wo animation
PDF
Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
PPTX
Video Analytics on Hadoop webinar victor fang-201309
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
PDF
Data meets AI - AICUG - Santa Clara
All thingspython@pivotal
Pivotal OSS meetup - MADlib and PivotalR
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python Powered Data Science at Pivotal (PyData 2013)
Data Science At Scale for IoT on the Pivotal Platform
Pivotal data science_data_engineering_secret_weapons_of_the_strategic_enterprise
 
Azure Databricks for Data Scientists
A Hands-on Intro to Data Science and R Presentation.ppt
Opportunities for data analytics in power generation affelt 2016
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
Xmplr power gen natgas 2016 wo animation
Data Science for Energy Efficiency (Dmytro Mindra Technology Stream)
Video Analytics on Hadoop webinar victor fang-201309
Massively Parallel Processing with Procedural Python (PyData London 2014)
Data meets AI - AICUG - Santa Clara
Ad

Recently uploaded (20)

PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Computer network topology notes for revision
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Quality review (1)_presentation of this 21
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Moving the Public Sector (Government) to a Digital Adoption
Reliability_Chapter_ presentation 1221.5784
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Computer network topology notes for revision
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Quality review (1)_presentation of this 21
Clinical guidelines as a resource for EBP(1).pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Foundation of Data Science unit number two notes
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

Data Science at Scale on MPP databases - Use Cases & Open Source Tools

  • 1. 1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved. Esther Vasiete Pivotal Data Scientist Structure Data 2016 Data Science at Scale on MPP Databases – Use Cases & Open Source Tools Joint work with Pivotal Data Science
  • 2. 2© Copyright 2016 Pivotal. All rights reserved. Agenda Ÿ  Introduction Ÿ  Open Source Data Science Toolkit Ÿ  Real world applications –  Predictive maintenance of automobiles –  Predicting insurance claims –  Predicting customer churn Ÿ  Data science deep-dive with Jupyter notebooks –  Text analytics on MPP (github.com/vatsan) –  Image processing on MPP (github.com/gautamsm)
  • 3. 3© Copyright 2016 Pivotal. All rights reserved. Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs) Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies. Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in- class data science and data engineering services, with a deep emphasis on knowledge transfer. Data Science Data Engineering App Dev
  • 4. 4© Copyright 2016 Pivotal. All rights reserved. Pivotal Data Science Knowledge Development
  • 5. 5© Copyright 2016 Pivotal. All rights reserved. Use Case: Preventive Maintenance for Connected Vehicles Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC) and vehicle status data to the Pivotal analytics environment Ÿ  Can the DTC data be leveraged to predict the presence of potential problems in vehicles? Ÿ  Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data
  • 6. 6© Copyright 2016 Pivotal. All rights reserved. Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs) Time Job Type: Transmission Job Type: Transmission Engine Job Type: Body DTC: B DTC: B, P, C DTC: U DTC: B DTC: B DTC: B, P, C, U DTC: P, B, U DTC: P DTC: B DTC: B,P DTC: B,P Can the DTCs observed here predict this Job Type? Can the DTCs observed here predict this Job Type? Can the DTCs observed here predict this Job Type?
  • 7. 7© Copyright 2016 Pivotal. All rights reserved. Data Parallelism One or more job on the same day Multi-labeling problem One-vs-rest classifiers built in parallel 1 0 0 1 0 1 0 Class 1 Class 2 Class 3 One-vs-Rest Classification Red vs. Non Red On Segment 1 Green vs. Non Green On Segment 2 Blue vs. Non Blue On Segment N
  • 8. 8© Copyright 2016 Pivotal. All rights reserved. Model Scoring Pipeline DTC: B DTC: B, P, C DTC: U Body Axle Engine Prob >= Threshold Prob >= Threshold Prob >= Threshold Model Caching (GPDB/ HAWQ) Real time scoring web or mobile app dashboard Ingest Sink
  • 9. 9© Copyright 2016 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Segments/Workers Master Rows are distributed across segments by a particular field (or randomly)
  • 10. 10© Copyright 2016 Pivotal. All rights reserved. IT TAKES MORE THAN ONE TOOL
  • 11. 11© Copyright 2016 Pivotal. All rights reserved. Open Source Data Science Toolkit KEY LANGUAGES P L A T F O R M KEY TOOLS MLlib PL/X Pivotal Big Data Suite ModelingTools VisualizationTools Platform GemFire
  • 12. 12© Copyright 2016 Pivotal. All rights reserved. Scalable, In-Database Machine Learning •  Open Source https://github.com/madlib/madlib •  Works on Greenplum DB, Apache HAWQ and PostgreSQL •  In active development by Pivotal •  MADlib is now an Apache Software Foundation incubator project! Apache (incubating)
  • 13. 13© Copyright 2016 Pivotal. All rights reserved. Functions Supervised Learning Regression Models •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Generalized Linear Models •  Linear Regression •  Logistic Regression •  Marginal Effects •  Multinomial Regression •  Ordinal Regression •  Robust Variance, Clustered Variance •  Support Vector Machines Tree Methods •  Decision Tree •  Random Forest Other Methods •  Conditional Random Field •  Naïve Bayes Unsupervised Learning •  Association Rules (Apriori) •  Clustering (K-means) •  Topic Modeling (LDA) Statistics Descriptive •  Cardinality Estimators •  Correlation •  Summary Inferential •  Hypothesis Tests Other Statistics •  Probability Functions Other Modules •  Conjugate Gradient •  Linear Solvers •  PMML Export •  Random Sampling •  Term Frequency for Text Time Series •  ARIMA Aug 2015 Data Types and Transformations •  Array Operations •  Dimensionality Reduction (PCA) •  Encoding Categorical Variables •  Matrix Operations •  Matrix Factorization (SVD, Low Rank) •  Norms and Distance Functions •  Sparse Vectors Model Evaluation •  Cross Validation Predictive Analytics Library @MADlib_analytic
  • 14. 14© Copyright 2016 Pivotal. All rights reserved. Use Case: Predicting insurance claim amounts using structured and unstructured data Ÿ  Using features from structured and unstructured data sources associated with claims, build the capability to predict claim amounts
  • 15. 15© Copyright 2016 Pivotal. All rights reserved. Text analytics on MPP Ÿ  Unstructured data in the form of claim comments and claim descriptions (text) Ÿ  Use a bag-of-words approach (unigrams, bigrams) Ÿ  tf-idf for more meaningful insights
  • 16. 16© Copyright 2016 Pivotal. All rights reserved. Code walkthrough: Text analytics on MPP github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models We’ll walk through this Jupyter notebook
  • 17. 17© Copyright 2016 Pivotal. All rights reserved. Use Case: Churn prediction Ÿ  Build a churn model to predict which customers are most likely to churn Ÿ  Provide insights into key factors responsible for churn to potentially intervene prior to churn
  • 18. 18© Copyright 2016 Pivotal. All rights reserved. Usage Time Series Data Ÿ  Aggregate weekly usage by user Ÿ  Compute descriptive statistics Ÿ  Extract features based on business expertise
  • 19. 19© Copyright 2016 Pivotal. All rights reserved. Open Source Analytics Ecosystem Companies benefit from algorithmic breadth and scalability for building and socializing data science models MLlib PL/X Algorithms Visualization Best of breed in-memory and in-database tools for an MPP platform
  • 20. 20© Copyright 2016 Pivotal. All rights reserved. •  For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++ •  The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL •  plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)
  • 21. 21© Copyright 2016 Pivotal. All rights reserved. User Defined Functions (UDFs) in PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE  FUNCTION  seasonality  (x  float[])      RETURNS  float[]   AS  $$      import  statsmodels.api  as  sm      s  =  sm.tsa.seasonal_decompose(x).seasonal        return  s   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  • 22. 22© Copyright 2016 Pivotal. All rights reserved. Usage Time Series Data with PL/X Ÿ  Easily harness your UDF with open source libraries (for machine learning, signal processing...) Ÿ  Runs at scale through data parallelism
  • 23. 23© Copyright 2016 Pivotal. All rights reserved. Code walkthrough: Image processing on MPP github.com/gautamsm/data-science-on-mpp/tree/master/image_processing In-database Canny edge detection with OpenCV inside a PL/C function
  • 24. 24© Copyright 2016 Pivotal. All rights reserved. Pivotal Data Science Blogs 1.  Scaling native (C++) apps on Pivotal MPP 2.  Predicting commodity futures through Tweets 3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum 4.  Using data science to predict TV viewer behavior 5.  Twitter NLP: Scaling part-of-speech tagging 6.  Distributed deep learning on MPP and Hadoop 7.  Multi-variate time series forecasting 8.  Pivotal for good – Crisis Textline http://blog.pivotal.io/data-science-pivotal
  • 25. 25© Copyright 2016 Pivotal. All rights reserved. Thank You!
  • 26. A NEW PLATFORM FOR A NEW ERA