Massively Parallel Processing with Procedural Python (PyData London 2014)

BUILT FOR THE SPEED OF BUSINESS

Massively Parallel Processing
with Procedural Python
How do we use the PyData stack in data science
engagements at Pivotal?
Ian Huston, @ianhuston
Data Scientist, Pivotal

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

2

Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (written for PL/R but applies to PL/Python):
http://gopivotal.github.io/gp-r/
Ÿ  Traffic Disruption demo (if we have time)
http://ds-demo-transport.cfapps.io
@ianhuston

3

About Pivotal
Data-Driven
Application
Development

Pivotal Data
Science Labs

Cloud
Application
Platform
Data &
Analytics
Platform

Virtualization

Cloud Storage

@ianhuston

4

What do our customers look like?
Ÿ  Large enterprises with lots of data collected
–  Work with 10s of TBs to PBs of data, structured & unstructured

Ÿ  Not able to get what they want out of their data
–  Old Legacy systems with high cost and no flexibility
–  Response times are too slow for interactive data analysis
–  Can only deal with small samples of data locally

Ÿ  They want to transform into data driven enterprises

@ianhuston

5

Open Source is Pivotal

@ianhuston

6

Pivotal’s Open Source Contributions
Lots more interesting small projects:

•  PyMADlib – Python Wrapper for MADlib
https://github.com/gopivotal/pymadlib

•  PivotalR – R wrapper for MADlib
http://github.com/madlib-internal/PivotalR

•  Part-of-speech tagger for Twitter via SQL
http://vatsan.github.io/gp-ark-tweet-nlp/

•  Pandas via psql
(interactive PostgreSQL terminal)
https://github.com/vatsan/pandas_via_psql

@ianhuston

7

Typical Engagement Tech Setup
Ÿ  Platform:

–  Greenplum Analytics Database (GPDB)
–  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop)

Ÿ  Open Source Options (http://gopivotal.com):
–  Greenplum Community Edition
–  Pivotal HD Community Edition (HAWQ not included)
–  MADlib in-database machine learning library (http://madlib.net)

Ÿ  Where Python fits in:
–  PL/Python running in-database, with nltk, scikit-learn etc
–  IPython for exploratory analysis
–  Pandas, Matplotlib etc.
@ianhuston

8

PIVOTAL DATA SCIENCE
TOOLKIT
1

Find Data

Platforms
•  Greenplum DB
•  Pivotal HD
•  Hadoop (other)
•  SAS HPA
•  AWS

2

3

Run Code

Interfaces
•  pgAdminIII
•  psql
•  psycopg2
•  Terminal
•  Cygwin
•  Putty
•  Winscp

Write Code

Editing Tools
•  Vi/Vim
•  Emacs
•  Smultron
•  TextWrangler
•  Eclipse
•  Notepad++
•  IPython
•  Sublime

Languages
•  SQL
•  Bash scripting
•  C
•  C++
•  C#
•  Java
•  Python
•  R

4

Write Code for Big Data

In-Database
•  SQL
•  PL/Python
•  PL/Java
•  PL/R
•  PL/pgSQL
5

Hadoop
•  HAWQ
•  Pig
•  Hive
•  Java

6

Visualization
•  python-matplotlib
•  python-networkx
•  D3.js
•  Tableau

Implement Algorithms

Libraries
•  MADlib
Java
•  Mahout
R
•  (Too many to list!)
Text
•  OpenNLP
•  NLTK
•  GPText
C++
•  opencv

Show Results

Python
•  NumPy
•  SciPy
•  scikit-learn
•  Pandas
Programs
•  Alpine Miner
•  Rstudio
•  MATLAB
•  SAS
•  Stata

•  GraphViz
•  Gephi
•  R (ggplot2, lattice,
shiny)
•  Excel
7

Collaborate

Sharing Tools
•  Chorus
•  Confluence
•  Socialcast
•  Github
•  Google Drive &
Hangouts

A large and
varied tool box!

@ianhuston

9

PL/Python

@ianhuston
2013

10

MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Master

Workers
@ianhuston

11

Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  map() function in Python

@ianhuston

12

User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.

Definition:

Execution:

CREATE
FUNCTION
times2(INT)

RETURNS
INT

AS
$$

SELECT
2
*
$1

$$
LANGUAGE
sql;

SELECT
times2(1);

times2

-‐-‐-‐-‐-‐-‐-‐-‐

2

(1
row)

@ianhuston

13

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages

SQL
Master
Host

Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture

Standby
Master

Interconnect

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

…

@ianhuston

14

Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.

SQL wrapper
Normal Python
SQL wrapper

CREATE
FUNCTION
pymax
(a
integer,
b
integer)

RETURNS
integer

AS
$$

if
a
>
b:

return
a

return
b

$$
LANGUAGE
plpythonu;

@ianhuston

15

Examples

@ianhuston
2013

16

Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:

CREATE
TYPE
named_value
AS
(

name

text,

value

integer

);

Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE
FUNCTION
make_pair
(name
text,
value
integer)

RETURNS
named_value

AS
$$

return
[
name,
value
]

#
or
alternatively,
as
tuple:
return
(
name,
value
)

#
or
as
dict:
return
{
"name":
name,
"value":
value
}

#
or
as
an
object
with
attributes
.name
and
.value

$$
LANGUAGE
plpythonu;

Ÿ  For functions which return multiple rows, prefix “setof” before the return type

@ianhuston

17

Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:

Sequence

Generator

CREATE
FUNCTION
make_pair
(name
text)

RETURNS
SETOF
named_value

AS
$$

return
([
name,
1
],
[
name,
2
],
[
name,
3])

$$
LANGUAGE
plpythonu;

CREATE
FUNCTION
make_pair
(name
text)

RETURNS
SETOF
named_value

AS
$$

for
i
in
range(3):

yield
(name,
i)

$$
LANGUAGE
plpythonu;

@ianhuston

18

Accessing Packages
Ÿ  On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
–  Can use “parallel ssh” tool gpssh to conda/pip install
–  Currently Greenplum DB ships with Python 2.6 (!)

Ÿ  Then just import as usual inside function:

CREATE
FUNCTION
make_pair
(name
text)

RETURNS
named_value

AS
$$

import
numpy
as
np

return
((name,i)
for
i
in
np.arange(3))

$$
LANGUAGE
plpythonu;

@ianhuston

19

Benefits of PL/Python
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python (or R/Java/C)
experience quickly.
Ÿ  Apply Python across terabytes of data with minimal
overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.
@ianhuston

20

MADlib

@ianhuston
2013

21

Going Beyond Data Parallelism
Ÿ  Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
Ÿ  This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
Ÿ  For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.
@ianhuston

22

MADlib: The Origin
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills

•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• 

Open Source!
https://github.com/madlib/madlib

• 
• 

Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
Active development by Pivotal
- 

• 

Latest Release: v1.4 (Nov 2013)

Downloads and Docs:
http://madlib.net/

•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein

@ianhuston

23

MADlib Executes Algorithms In-Place
MADlib User

MADlib Advantages
Master
Processor

Ø 

SQL

SQL

SQL

M

M

M

No Data Movement

Ø 

M

Use MPP architecture’s
full compute power

Ø 

Use MPP architecture’s
entire memory to
process data sets

Segment
Processors

@ianhuston

24

MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)

Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank

Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Linear Systems
•  Sparse and Dense Solvers

Sketch-based Estimators
•  CountMin (CormodeMuthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

@ianhuston

25

Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abstraction Layer
(iteration controller, ...)

RDBMS
Built-in
Functions

SQL, generated from
specification

Python with
templated SQL
Python

Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)

C++

RDBMS Query Processing
(Greenplum, PostgreSQL, …)

@ianhuston

26

How does it work ? : A Linear Regression Example
Ÿ  Finding linear dependencies between variables
–  y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of dependent
variables y

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

from unm limit 6;

Design Matrix X

@ianhuston

27

Reminder: Linear-Regression Model
• 
•  If residuals i.i.d. Gaussians with standard deviation σ:
–  max likelihood ⇔ min sum of squared residuals

f (y | x) ∝ exp

−

1
· (y − xT c)2
2σ 2

•  First-order conditions for the following quadratic objective (in c)
yield the minimizer

@ianhuston

28

Linear Regression: Streaming Algorithm
•  How to compute with a single table scan?
-1

XT

XT

X

XTX

y

XTy

@ianhuston

29

Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1

y1

Segment 2

T
T
X T y = X 1 X2

Master

T
X2 y2

y1
y2

=

XTy

T
Xi y i

@ianhuston

30

Demos
Ÿ  We built demos to showcase our technology pipeline, using
Python technology.
Ÿ  Two use cases:
–  Topic and Sentiment Analysis of Tweets
–  London Road Traffic Disruption prediction

@ianhuston

31

Topic and Sentiment Analysis Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis through
MADlib pLDA
(gpfdist)
Loaded as
external tables
into GPDB

Parallel Parsing of
JSON and extraction
of fields using PL/
Python

Sentiment Analysis
through custom
PL/Python functions

@ianhuston

32

Transport Disruption Prediction Pipeline

Transport for London
Traffic Disruption feed

Pivotal Greenplum
Database

d3.js
&
NVD3

Interactive SVG figures

Deduplication

Feature Creation

Modelling & Machine Learning

@ianhuston

33

Get in touch
Feel free to contact me about PL/Python, or more generally
about Data Science and opportunities available.

@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net
@ianhuston

34

Massively Parallel Processing with Procedural Python (PyData London 2014)

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Massively Parallel Processing with Procedural Python (PyData London 2014) (20)

More from Ian Huston (8)

Recently uploaded (20)

Massively Parallel Processing with Procedural Python (PyData London 2014)