Parallel Linear Regression in Interative Reduce and YARN

Josh Patterson
Email: Past
Published in IAAI-09:
josh@floe.tv
“TinyTermite: A Secure Routing Algorithm”
Twitter: Grad work in Meta-heuristics, Ant-
algorithms

@jpatanooga Tennessee Valley Authority
(TVA)
Github: Hadoop and the Smartgrid
Cloudera
https://github.com/jp Principal Solution Architect
atanooga
Today
Independent Consultant

Sections

1. Modern Data Analytics
2. Parallel Linear Regression
3. Performance and Results

The World as Optimization
Data tells us about our model/engine/product
We take this data and evolve our product towards a
state of minimal market error
WSJ Special Section, Monday March 11, 2013
Zynga changing games based off player behavior
UPS cut fuel consumption by 8.4MM gallons
Ford used sentiment analysis to look at how new car
features would be received

The Modern Data Landscape
Apps are coming but they need
Platforms
Components
Workflows
Lots of investment in Hadoop in this space
Lots of ETL pipelines
Lots of descriptive Statistics
Growing interest in Machine Learning

Hadoop as The Linux of Data

Hadoop has won the Cycle “Hadoop is the
kernel of a
Gartner: Hadoop will be in
distributed operating
2/3s of advanced analytics
products by 2015 [1] system, and all the
other components
around the kernel
are now arriving on
this stage”
---Doug Cutting

Today’s Hadoop ML Pipeline
Data cleansing / ETL performed with Hive or Pig
Data In Place Processed
Mahout
R
Custom MapReduce Algorithm
Or Externally Processed
SAS
SPSS
KXEN
Weka

As Focus Shifts to Applications

Data rates have been climbing fast
Speed at Scale becomes the new Killer App
Companies will want to leverage the Big Data
infrastructure they’ve already been working with
Hadoop
HDFS as main storage system
A drive to validate big data investments with results
Emergence of applications which create “data
products”

Patterson’s Law

“As the percent of your total data held
in a storage system approaches 100%
the amount of in-system processing
and analytics also approaches 100%”

Tools Will Move onto Hadoop

Already seeing this with Vendors
Who hasn’t announced a SQL engine on Hadoop
lately?
Trend will continue with machine learning tools
Mahout was the beginning
More are following
But what about parallel iterative algorithms?

Distributed Systems Are Hard
Lots of moving parts
Especially as these applications become more complicated
Machine learning can be a non-trivial operation
We need great building blocks that work well together
I agree with Jimmy Lin [3]: “keep it simple”
“make sure costs don’t outweigh benefits”
Minimize “Yet Another Tool To Learn” (YATTL) as much as
we can!

To Summarize
Data moving into Hadoop everywhere
Patterson’s Law
Focus on hadoop, build around next-gen “linux of data”
Need simple components to build next-gen data base apps
They should work cleanly with the cluster that the fortune
500 has: Hadoop
Also should be easy to integrate into Hadoop and with the
hadoop-tool ecosystem
Minimize YATTL

Linear Regression
In linear regression, data is
modeled using linear predictor
functions

unknown model parameters are
estimated from the data.
We use optimization techniques
like Stochastic Gradient Descent to
find the coeffcients in the model

Y = (1*x0) + (c1*x1) + … + (cN*xN)

Machine Learning and Optimization

Algorithms

(Convergent) Iterative Methods

Newton’s Method
Quasi-Newton
Gradient Descent
Heuristics

AntNet
PSO
Genetic Algorithms

Stochastic Gradient Descent

Hypothesis about data

Cost function

Update function

Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view
/11

Stochastic Gradient Descent
Training Data
Training
Simple gradient descent
procedure
Loss functions needs to be
convex (with exceptions)
SGD
Linear Regression
Loss Function: squared error of
prediction
Prediction: linear combination of
Model
coefficients and input variables

Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation

Current Limitations
Sequential algorithms on a single node only goes so
far

The “Data Deluge”
Presents algorithmic challenges when combined with
large data sets
need to design algorithms that are able to perform in a
distributed fashion
MapReduce only fits certain types of algorithms

Distributed Learning Strategies

McDonald, 2010
Distributed Training Strategies for the Structured
Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Dean’s Work on Parallel SGD
DownPour SGD
Sandblaster

MapReduce vs. Parallel Iterative

Input
Processor Processor Processor

Map Map Map
Superstep 1

Processor Processor Processor

Reduce Reduce
Superstep 2

Output . . .

YARN
Yet Another Resource
Node
Manager

Negotiator Container App Mstr

Framework for scheduling
Client

distributed applications
Resource Node
Manager Manager
Client

App Mstr Container

Allows for any type of parallel
application to run natively on
hadoop MapReduce Status Node
Manager
Job Submission

MRv2 is now a distributed
Node Status
Resource Request Container Container

application

IterativeReduce
Designed specifically for parallel iterative
algorithms on Hadoop
Implemented directly on top of YARN
Intrinsic Parallelism
Easier to focus on problem
Not focusing on the distributed application part

IterativeReduce API
ComputableMaster Worker Worker Worker

Setup()
Master
Compute()
Complete() Worker Worker Worker
ComputableWorker
Master
Setup()
Compute() . . .

SGD Master
Collects all parameter vectors at each pass /
superstep
Produces new global parameter vector
By averaging workers’ vectors
Sends update to all workers
Workers replace local parameter vector with new
global parameter vector

SGD Worker
Each given a split of the total dataset
Similar to a map task
Performs local SGD pass

Local parameter vector sent to master at
superstep

Stays active/resident between iterations

SGD: Serial vs Parallel
Split 1 Split 2 Split 3

Training Data

Worker N
Worker 1 Worker 2
…

Partial Partial Model Partial
Model Model

Master

Model Global Model

Parallel Linear Regression with IterativeReduce

Based directly on work we did with Knitting Boar
Parallel logistic regression
Scales linearly with input size
Can produce a linear regression model off large amounts
of data
Packaged in a new suite of parallel iterative algorithms
called Metronome
100% Java, ASF 2.0 Licensed, on github

Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN
applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/s
rc/test/java/tv/floe/metronome/linearregression/iterative
reduce/TestSimulateLinearRegressionIterativeReduce.j
ava
https://github.com/jpatanooga/KnittingBoar/blob/master
/src/test/java/com/cloudera/knittingboar/sgd/iterativere
duce/TestKnittingBoar_IRUnitSim.java

Running the Job via YARN
Build with Maven

Copy Jar to host with cluster access

Copy dataset to HDFS

Run job
Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties

Results
Linear Regression - Parallel vs Serial
200
Total Processing Time

150

100
Parallel Runs
50 Serial Runs
0
64 128 192 256 320
Megabytes Processed Total

Lessons Learned
Linear scale continues to be achieved with
parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
YARN still experimental, has caveats
Container allocation is still slow
Metronome continues to be experimental

Special Thanks
Michael Katzenellenbollen

Dr. James Scott
University of Texas at Austin
Dr. Jason Baldridge
University of Texas at Austin

Future Directions
More testing, stability
Cache vectors in memory for speed
Metronome
Take on properties of LibLinear
Plugable optimization, general linear models
YARN-centric first class Hadoop citizen
Focus on being a complement to Mahout
K-means, PageRank implementations

Github
IterativeReduce
https://github.com/emsixteeen/IterativeReduce
Metronome
https://github.com/jpatanooga/Metronome
Knitting Boar
https://github.com/jpatanooga/KnittingBoar

References
1. http://www.infoworld.com/d/business-
intelligence/gartner-hadoop-will-be-in-two-thirds-of-
advanced-analytics-products-2015-211475

2. https://cwiki.apache.org/MAHOUT/logistic-
regression.html

3. MapReduce is Good Enough? If All You Have is a
Hammer, Throw Away Everything That’s Not a Nail!

• http://arxiv.org/pdf/1209.2191.pdf

Parallel Linear Regression in Interative Reduce and YARN

More Related Content

What's hot (20)

Similar to Parallel Linear Regression in Interative Reduce and YARN (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Parallel Linear Regression in Interative Reduce and YARN

Editor's Notes