SlideShare a Scribd company logo
BUILT FOR THE SPEED OF BUSINESS
Massively Parallel Processing
with Procedural Python
How do we use the PyData stack in data science
engagements at Pivotal?
Ian Huston, @ianhuston
Data Scientist, Pivotal

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

2
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (written for PL/R but applies to PL/Python):
http://gopivotal.github.io/gp-r/
Ÿ  Traffic Disruption demo (if we have time)
http://ds-demo-transport.cfapps.io
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

3
About Pivotal
Data-Driven
Application
Development

Pivotal Data
Science Labs

Cloud
Application
Platform
Data &
Analytics
Platform

Virtualization

Cloud Storage

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

4
What do our customers look like?
Ÿ  Large enterprises with lots of data collected
–  Work with 10s of TBs to PBs of data, structured & unstructured

Ÿ  Not able to get what they want out of their data
–  Old Legacy systems with high cost and no flexibility
–  Response times are too slow for interactive data analysis
–  Can only deal with small samples of data locally

Ÿ  They want to transform into data driven enterprises

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

5
Open Source is Pivotal

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

6
Pivotal’s Open Source Contributions
Lots more interesting small projects:

•  PyMADlib – Python Wrapper for MADlib
https://github.com/gopivotal/pymadlib

•  PivotalR – R wrapper for MADlib
http://github.com/madlib-internal/PivotalR

•  Part-of-speech tagger for Twitter via SQL
http://vatsan.github.io/gp-ark-tweet-nlp/

•  Pandas via psql
(interactive PostgreSQL terminal)
https://github.com/vatsan/pandas_via_psql

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

7
Typical Engagement Tech Setup
Ÿ  Platform:

–  Greenplum Analytics Database (GPDB)
–  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop)

Ÿ  Open Source Options (http://gopivotal.com):
–  Greenplum Community Edition
–  Pivotal HD Community Edition (HAWQ not included)
–  MADlib in-database machine learning library (http://madlib.net)

Ÿ  Where Python fits in:
–  PL/Python running in-database, with nltk, scikit-learn etc
–  IPython for exploratory analysis
–  Pandas, Matplotlib etc.
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

8
PIVOTAL DATA SCIENCE
TOOLKIT
1

Find Data

Platforms
•  Greenplum DB
•  Pivotal HD
•  Hadoop (other)
•  SAS HPA
•  AWS

2

3

Run Code

Interfaces
•  pgAdminIII
•  psql
•  psycopg2
•  Terminal
•  Cygwin
•  Putty
•  Winscp

Write Code

Editing Tools
•  Vi/Vim
•  Emacs
•  Smultron
•  TextWrangler
•  Eclipse
•  Notepad++
•  IPython
•  Sublime

Languages
•  SQL
•  Bash scripting
•  C
•  C++
•  C#
•  Java
•  Python
•  R

4

Write Code for Big Data

In-Database
•  SQL
•  PL/Python
•  PL/Java
•  PL/R
•  PL/pgSQL
5

Hadoop
•  HAWQ
•  Pig
•  Hive
•  Java

6

Visualization
•  python-matplotlib
•  python-networkx
•  D3.js
•  Tableau

Implement Algorithms

Libraries
•  MADlib
Java
•  Mahout
R
•  (Too many to list!)
Text
•  OpenNLP
•  NLTK
•  GPText
C++
•  opencv

Show Results

Python
•  NumPy
•  SciPy
•  scikit-learn
•  Pandas
Programs
•  Alpine Miner
•  Rstudio
•  MATLAB
•  SAS
•  Stata

•  GraphViz
•  Gephi
•  R (ggplot2, lattice,
shiny)
•  Excel
7

Collaborate

Sharing Tools
•  Chorus
•  Confluence
•  Socialcast
•  Github
•  Google Drive &
Hangouts

A large and
varied tool box!

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

9
PL/Python

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

10
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Master

Workers
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

11
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  map() function in Python

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

12
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.

Definition:

Execution:

CREATE	
  FUNCTION	
  times2(INT)	
  
RETURNS	
  INT	
  
AS	
  $$	
  
	
  	
  	
  	
  SELECT	
  2	
  *	
  $1	
  
$$	
  LANGUAGE	
  sql;	
  

SELECT	
  times2(1);	
  
	
  times2	
  	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  	
  	
  	
  	
  2	
  
(1	
  row)	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

13
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages

SQL
Master
Host

Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture

Standby
Master

Interconnect

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

…

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

14
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.

SQL wrapper
Normal Python
SQL wrapper

CREATE	
  FUNCTION	
  pymax	
  (a	
  integer,	
  b	
  integer)	
  
	
  	
  RETURNS	
  integer	
  
AS	
  $$	
  
	
  	
  if	
  a	
  >	
  b:	
  
	
  	
  	
  	
  return	
  a	
  
	
  	
  return	
  b	
  
$$	
  LANGUAGE	
  plpythonu;	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

15
Examples

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

16
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:	
  
CREATE	
  TYPE	
  named_value	
  AS	
  (	
  
	
  	
  name	
  	
  text,	
  
	
  	
  value	
  	
  integer	
  
);	
  

Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text,	
  value	
  integer)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  [	
  name,	
  value	
  ]	
  
	
  	
  #	
  or	
  alternatively,	
  as	
  tuple:	
  return	
  (	
  name,	
  value	
  )	
  
	
  	
  #	
  or	
  as	
  dict:	
  return	
  {	
  "name":	
  name,	
  "value":	
  value	
  }	
  
	
  	
  #	
  or	
  as	
  an	
  object	
  with	
  attributes	
  .name	
  and	
  .value	
  
$$	
  LANGUAGE	
  plpythonu;	
  

Ÿ  For functions which return multiple rows, prefix “setof” before the return type

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

17
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:

Sequence

Generator

CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  ([	
  name,	
  1	
  ],	
  [	
  name,	
  2	
  ],	
  [	
  name,	
  3])	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  	
  AS	
  $$	
  
	
  	
  for	
  i	
  in	
  range(3):	
  
	
  	
  	
  	
  	
  	
  yield	
  (name,	
  i)	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

18
Accessing Packages
Ÿ  On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
–  Can use “parallel ssh” tool gpssh to conda/pip install
–  Currently Greenplum DB ships with Python 2.6 (!)

Ÿ  Then just import as usual inside function:

	
  	
  

CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  import	
  numpy	
  as	
  np	
  
	
  	
  return	
  ((name,i)	
  for	
  i	
  in	
  np.arange(3))	
  
$$	
  LANGUAGE	
  plpythonu;	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

19
Benefits of PL/Python
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python (or R/Java/C)
experience quickly.
Ÿ  Apply Python across terabytes of data with minimal
overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

20
MADlib

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

21
Going Beyond Data Parallelism
Ÿ  Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
Ÿ  This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
Ÿ  For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

22
MADlib: The Origin
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills

•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• 

Open Source!
https://github.com/madlib/madlib

• 
• 

Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
Active development by Pivotal
- 

• 

Latest Release: v1.4 (Nov 2013)

Downloads and Docs:
http://madlib.net/

•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

23
MADlib Executes Algorithms In-Place
MADlib User

MADlib Advantages
Master
Processor

Ø 

SQL

SQL

SQL

M

M

M

No Data Movement

Ø 

M

Use MPP architecture’s
full compute power

Ø 

Use MPP architecture’s
entire memory to
process data sets

Segment
Processors

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

24
MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)

Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank

Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Linear Systems
•  Sparse and Dense Solvers

Sketch-based Estimators
•  CountMin (CormodeMuthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

25
Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abstraction Layer
(iteration controller, ...)

RDBMS
Built-in
Functions

SQL, generated from
specification

Python with
templated SQL
Python

Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)

C++

RDBMS Query Processing
(Greenplum, PostgreSQL, …)

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

26
How does it work ? : A Linear Regression Example
Ÿ  Finding linear dependencies between variables
–  y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of dependent
variables y

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

from unm limit 6;

Design Matrix X

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

27
Reminder: Linear-Regression Model
• 
•  If residuals i.i.d. Gaussians with standard deviation σ:
–  max likelihood ⇔ min sum of squared residuals

f (y | x) ∝ exp

−

1
· (y − xT c)2
2σ 2

•  First-order conditions for the following quadratic objective (in c)
yield the minimizer

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

28
Linear Regression: Streaming Algorithm
•  How to compute with a single table scan?
-1

XT

XT

X

XTX

y

XTy

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

29
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1

y1

Segment 2

T
T
X T y = X 1 X2

Master

T
X2 y2

y1
y2

=

XTy

T
Xi y i

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

30
Demos
Ÿ  We built demos to showcase our technology pipeline, using
Python technology.
Ÿ  Two use cases:
–  Topic and Sentiment Analysis of Tweets
–  London Road Traffic Disruption prediction

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

31
Topic and Sentiment Analysis Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis through
MADlib pLDA
(gpfdist)
Loaded as
external tables
into GPDB

Parallel Parsing of
JSON and extraction
of fields using PL/
Python

Sentiment Analysis
through custom
PL/Python functions

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

32
Transport Disruption Prediction Pipeline

Transport for London
Traffic Disruption feed

Pivotal Greenplum
Database

d3.js	
  &	
  NVD3	
  
Interactive SVG figures

Deduplication

Feature Creation

Modelling & Machine Learning

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

33
Get in touch
Feel free to contact me about PL/Python, or more generally
about Data Science and opportunities available.

@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

34
BUILT FOR THE SPEED OF BUSINESS

More Related Content

PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PDF
Python on Cloud Foundry
PPTX
Introduction to Pig | Pig Architecture | Pig Fundamentals
PDF
High-level Programming Languages: Apache Pig and Pig Latin
PPTX
Quadrupling your elephants - RDF and the Hadoop ecosystem
PDF
Sempala - Interactive SPARQL Query Processing on Hadoop
PDF
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
PDF
Apache Pig: Making data transformation easy
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Python on Cloud Foundry
Introduction to Pig | Pig Architecture | Pig Fundamentals
High-level Programming Languages: Apache Pig and Pig Latin
Quadrupling your elephants - RDF and the Hadoop ecosystem
Sempala - Interactive SPARQL Query Processing on Hadoop
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Apache Pig: Making data transformation easy

What's hot (20)

PDF
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
PDF
Big Data Hadoop Training
PPTX
Apache pig
PDF
Running R on Hadoop - CHUG - 20120815
PPT
Map Reduce introduction
PPTX
Pig on Tez - Low Latency ETL with Big Data
PDF
Hadoop Design and k -Means Clustering
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
PDF
Scale up and Scale Out Anaconda and PyData
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PPTX
Map reduce and Hadoop on windows
PDF
myHadoop 0.30
PPTX
PPTX
Pig programming is more fun: New features in Pig
PDF
Introduction To Apache Pig at WHUG
PPTX
MapReduce Design Patterns
PPTX
Parallel Linear Regression in Interative Reduce and YARN
PDF
HCatalog: Table Management for Hadoop - CHUG - 20120917
PDF
Hadoop, Pig, and Python (PyData NYC 2012)
PDF
Introduction To Elastic MapReduce at WHUG
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Big Data Hadoop Training
Apache pig
Running R on Hadoop - CHUG - 20120815
Map Reduce introduction
Pig on Tez - Low Latency ETL with Big Data
Hadoop Design and k -Means Clustering
Scalable Hadoop with succinct Python: the best of both worlds
Scale up and Scale Out Anaconda and PyData
Nov HUG 2009: Hadoop Record Reader In Python
Map reduce and Hadoop on windows
myHadoop 0.30
Pig programming is more fun: New features in Pig
Introduction To Apache Pig at WHUG
MapReduce Design Patterns
Parallel Linear Regression in Interative Reduce and YARN
HCatalog: Table Management for Hadoop - CHUG - 20120917
Hadoop, Pig, and Python (PyData NYC 2012)
Introduction To Elastic MapReduce at WHUG
Ad

Viewers also liked (7)

PDF
The DSP/BIOS Bridge - OMAP3
PPTX
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PPTX
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PDF
Cloud Foundry for Data Science
PDF
Programming with Python and PostgreSQL
PPT
Big Data & Sentiment Analysis
The DSP/BIOS Bridge - OMAP3
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Cloud Foundry for Data Science
Programming with Python and PostgreSQL
Big Data & Sentiment Analysis
Ad

Similar to Massively Parallel Processing with Procedural Python (PyData London 2014) (20)

PDF
Massively Parallel Process with Prodedural Python by Ian Huston
PDF
Python Powered Data Science at Pivotal (PyData 2013)
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
PDF
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
PDF
Docopt, beautiful command-line options for R, user2014
PDF
Enabling Python to be a Better Big Data Citizen
PDF
Sql saturday pig session (wes floyd) v2
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
High-Performance Python On Spark
PDF
High Performance Python on Apache Spark
PDF
Business logic with PostgreSQL and Python
ODP
Programming Under Linux In Python
PDF
Introduction To Python
PDF
Pyhton-1a-Basics.pdf
PPTX
Government Polytechnic Arvi-1.pptx
PDF
What we can learn from Rebol?
PPTX
Shivam PPT.pptx
ODP
biopython, doctest and makefiles
PPTX
Researh toolbox - Data analysis with python
Massively Parallel Process with Prodedural Python by Ian Huston
Python Powered Data Science at Pivotal (PyData 2013)
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Docopt, beautiful command-line options for R, user2014
Enabling Python to be a Better Big Data Citizen
Sql saturday pig session (wes floyd) v2
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Hopsworks at Google AI Huddle, Sunnyvale
High-Performance Python On Spark
High Performance Python on Apache Spark
Business logic with PostgreSQL and Python
Programming Under Linux In Python
Introduction To Python
Pyhton-1a-Basics.pdf
Government Polytechnic Arvi-1.pptx
What we can learn from Rebol?
Shivam PPT.pptx
biopython, doctest and makefiles
Researh toolbox - Data analysis with python

More from Ian Huston (8)

PDF
CFSummit: Data Science on Cloud Foundry
PDF
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
PDF
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
PDF
Second Order Perturbations - National Astronomy Meeting 2011
PDF
Second Order Perturbations During Inflation Beyond Slow-roll
PDF
Inflation as a solution to the problems of the Big Bang
PDF
Cosmological Perturbations and Numerical Simulations
PDF
Cosmo09 presentation
CFSummit: Data Science on Cloud Foundry
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations During Inflation Beyond Slow-roll
Inflation as a solution to the problems of the Big Bang
Cosmological Perturbations and Numerical Simulations
Cosmo09 presentation

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Group 1 Presentation -Planning and Decision Making .pptx
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
A comparative analysis of optical character recognition models for extracting...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation

Massively Parallel Processing with Procedural Python (PyData London 2014)

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. Massively Parallel Processing with Procedural Python How do we use the PyData stack in data science engagements at Pivotal? Ian Huston, @ianhuston Data Scientist, Pivotal @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 2
  • 3. Some Links for this talk Ÿ  Simple code examples: https://github.com/ihuston/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython Ÿ  More info (written for PL/R but applies to PL/Python): http://gopivotal.github.io/gp-r/ Ÿ  Traffic Disruption demo (if we have time) http://ds-demo-transport.cfapps.io @ianhuston © Copyright 2014 Pivotal. All rights reserved. 3
  • 4. About Pivotal Data-Driven Application Development Pivotal Data Science Labs Cloud Application Platform Data & Analytics Platform Virtualization Cloud Storage @ianhuston © Copyright 2014 Pivotal. All rights reserved. 4
  • 5. What do our customers look like? Ÿ  Large enterprises with lots of data collected –  Work with 10s of TBs to PBs of data, structured & unstructured Ÿ  Not able to get what they want out of their data –  Old Legacy systems with high cost and no flexibility –  Response times are too slow for interactive data analysis –  Can only deal with small samples of data locally Ÿ  They want to transform into data driven enterprises @ianhuston © Copyright 2014 Pivotal. All rights reserved. 5
  • 6. Open Source is Pivotal @ianhuston © Copyright 2014 Pivotal. All rights reserved. 6
  • 7. Pivotal’s Open Source Contributions Lots more interesting small projects: •  PyMADlib – Python Wrapper for MADlib https://github.com/gopivotal/pymadlib •  PivotalR – R wrapper for MADlib http://github.com/madlib-internal/PivotalR •  Part-of-speech tagger for Twitter via SQL http://vatsan.github.io/gp-ark-tweet-nlp/ •  Pandas via psql (interactive PostgreSQL terminal) https://github.com/vatsan/pandas_via_psql @ianhuston © Copyright 2014 Pivotal. All rights reserved. 7
  • 8. Typical Engagement Tech Setup Ÿ  Platform: –  Greenplum Analytics Database (GPDB) –  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop) Ÿ  Open Source Options (http://gopivotal.com): –  Greenplum Community Edition –  Pivotal HD Community Edition (HAWQ not included) –  MADlib in-database machine learning library (http://madlib.net) Ÿ  Where Python fits in: –  PL/Python running in-database, with nltk, scikit-learn etc –  IPython for exploratory analysis –  Pandas, Matplotlib etc. @ianhuston © Copyright 2014 Pivotal. All rights reserved. 8
  • 9. PIVOTAL DATA SCIENCE TOOLKIT 1 Find Data Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS 2 3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp Write Code Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R 4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL 5 Hadoop •  HAWQ •  Pig •  Hive •  Java 6 Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau Implement Algorithms Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv Show Results Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata •  GraphViz •  Gephi •  R (ggplot2, lattice, shiny) •  Excel 7 Collaborate Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive & Hangouts A large and varied tool box! @ianhuston © Copyright 2014 Pivotal. All rights reserved. 9
  • 10. PL/Python @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 10
  • 11. MPP Architectural Overview Think of it as multiple PostGreSQL servers Master Workers @ianhuston © Copyright 2014 Pivotal. All rights reserved. 11
  • 12. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python @ianhuston © Copyright 2014 Pivotal. All rights reserved. 12
  • 13. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: Execution: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 13
  • 14. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages SQL Master Host Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture Standby Master Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment … @ianhuston © Copyright 2014 Pivotal. All rights reserved. 14
  • 15. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. SQL wrapper Normal Python SQL wrapper CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 15
  • 16. Examples @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 16
  • 17. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type @ianhuston © Copyright 2014 Pivotal. All rights reserved. 17
  • 18. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: Sequence Generator CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value   AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])     $$  LANGUAGE  plpythonu;   CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)     $$  LANGUAGE  plpythonu;   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 18
  • 19. Accessing Packages Ÿ  On Greenplum DB: To be available packages must be installed on the individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!) Ÿ  Then just import as usual inside function:     CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value   AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))   $$  LANGUAGE  plpythonu;   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 19
  • 20. Benefits of PL/Python Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python (or R/Java/C) experience quickly. Ÿ  Apply Python across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage. @ianhuston © Copyright 2014 Pivotal. All rights reserved. 20
  • 21. MADlib @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 21
  • 22. Going Beyond Data Parallelism Ÿ  Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. Ÿ  This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ  For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. @ianhuston © Copyright 2014 Pivotal. All rights reserved. 22
  • 23. MADlib: The Origin UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  Open Source! https://github.com/madlib/madlib •  •  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala Active development by Pivotal -  •  Latest Release: v1.4 (Nov 2013) Downloads and Docs: http://madlib.net/ •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein @ianhuston © Copyright 2014 Pivotal. All rights reserved. 23
  • 24. MADlib Executes Algorithms In-Place MADlib User MADlib Advantages Master Processor Ø  SQL SQL SQL M M M No Data Movement Ø  M Use MPP architecture’s full compute power Ø  Use MPP architecture’s entire memory to process data sets Segment Processors @ianhuston © Copyright 2014 Pivotal. All rights reserved. 24
  • 25. MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Linear Systems •  Sparse and Dense Solvers Sketch-based Estimators •  CountMin (CormodeMuthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions @ianhuston © Copyright 2014 Pivotal. All rights reserved. 25
  • 26. Architecture User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High-level Abstraction Layer (iteration controller, ...) RDBMS Built-in Functions SQL, generated from specification Python with templated SQL Python Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) C++ RDBMS Query Processing (Greenplum, PostgreSQL, …) @ianhuston © Copyright 2014 Pivotal. All rights reserved. 26
  • 27. How does it work ? : A Linear Regression Example Ÿ  Finding linear dependencies between variables –  y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 from unm limit 6; Design Matrix X @ianhuston © Copyright 2014 Pivotal. All rights reserved. 27
  • 28. Reminder: Linear-Regression Model •  •  If residuals i.i.d. Gaussians with standard deviation σ: –  max likelihood ⇔ min sum of squared residuals f (y | x) ∝ exp − 1 · (y − xT c)2 2σ 2 •  First-order conditions for the following quadratic objective (in c) yield the minimizer @ianhuston © Copyright 2014 Pivotal. All rights reserved. 28
  • 29. Linear Regression: Streaming Algorithm •  How to compute with a single table scan? -1 XT XT X XTX y XTy @ianhuston © Copyright 2014 Pivotal. All rights reserved. 29
  • 30. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 Segment 2 T T X T y = X 1 X2 Master T X2 y2 y1 y2 = XTy T Xi y i @ianhuston © Copyright 2014 Pivotal. All rights reserved. 30
  • 31. Demos Ÿ  We built demos to showcase our technology pipeline, using Python technology. Ÿ  Two use cases: –  Topic and Sentiment Analysis of Tweets –  London Road Traffic Disruption prediction @ianhuston © Copyright 2014 Pivotal. All rights reserved. 31
  • 32. Topic and Sentiment Analysis Pipeline Tweet Stream D3.js Stored on HDFS Topic Analysis through MADlib pLDA (gpfdist) Loaded as external tables into GPDB Parallel Parsing of JSON and extraction of fields using PL/ Python Sentiment Analysis through custom PL/Python functions @ianhuston © Copyright 2014 Pivotal. All rights reserved. 32
  • 33. Transport Disruption Prediction Pipeline Transport for London Traffic Disruption feed Pivotal Greenplum Database d3.js  &  NVD3   Interactive SVG figures Deduplication Feature Creation Modelling & Machine Learning @ianhuston © Copyright 2014 Pivotal. All rights reserved. 33
  • 34. Get in touch Feel free to contact me about PL/Python, or more generally about Data Science and opportunities available. @ianhuston ihuston @ gopivotal.com http://www.ianhuston.net @ianhuston © Copyright 2014 Pivotal. All rights reserved. 34
  • 35. BUILT FOR THE SPEED OF BUSINESS