Module 6: Ensemble Algorithms

This course content is being actively
developed by Delta Analytics, a 501(c)3
Bay Area nonproﬁt that aims to
empower communities to leverage their
data for good.
Please reach out with any questions or
feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Delta Analytics builds technical capacity
around the world.

❏ Ensemble approaches
❏ Bootstrap
❏ Bagging
❏ Random forest
❏ Boosting
Module Checklist:

What we’ve done:
Exploratory Analysis
Linear Regression (our ﬁrst model)
Decision Trees (our second model)
Up next:
Even more models!
Where are we?
Question
Exploratory
Analysis
Modeling
Phase
Validation
Phase
Source: Udacity - Model Building and Validation
Building
intuition
Expanding
our toolkit
}

Recap: In this example, we
predict Sam’s weekend activity
using decision rules trained on
historical weekend behavior.
How much $$ do I have?
Raining? Girlfriend?
NY
Concert! Clubbing!Walk in the park!Movie!
Y Y NN
Decision
Tree Task
Deﬁning f(x)
The decision tree f(x) predicts the
value of a target variable by
learning simple decision rules inferred
from the data features.
Our most important
predictive feature is Sam’s
budget. How do we know
this? Because it is the
root node.
Source: Friedman, Hastie, and Tibshirani. The elements of statistical learning. Vol. 1. Springer, Berlin: Springer series in statistics, 2001.

Decision
Tree Task
Recap: You are Sam’s weekend planner. What
should she do this weekend?
Let’s get this
weekend
started!
You’ve run a decision tree for
Sam, and now you’ve got a model.
But does it work well?
As always, we use our test data to
check our model, before we tell
him what to do with this
weekend.

Performance
Ability to
generalize to
unseen data
Our goal in evaluating
performance is to find
a sweet spot between
overfitting and
underfitting.
Recall our discussion of over and underﬁtting in previous modules:
Underfit Overfit
Sweet spot
Our most important goal is to build a
model that will generalize well to
unseen data.

Performance
How do we
measure
underﬁtting/
overﬁtting?
Figuring out if you are overfitting or
underfitting involves knowing how to
compare you train to to your test
results.
Underfit Overfit
Sweet spot
Training R2 Relationship Test R2 Condition
high > low
high ~ high
Sweet spot
low ~ low
low < high never happens
Overfitting
underfitting

Raining? Partner?
< $50
Y Y NN
>= $50
n=83
n=19
n=50 n=33
n=3n=31
Let’s see how this works in practice.
Firstly, we train our model f(x) using
training data.
n=30
n=104
train: 83
test: 21
Performance
Model
evaluation
1. Split data into train/test
2. Run model on train data
3. Test model on test data
Model error = Train True Y - Train Y*

Raining? Partner?
< $50
Y Y NN
1. Split data into train/test
2. Run model on train data
3. Test model on test data
Remember generalization error?
Review Module 4!
>= $50
Now, we use our f(x) developed using
training data to score unseen test
data.
n=104
train: 83
test: 21
Performance
Model
evaluation
Output: Test Y*
Inputs
(x’s)
TrueY
Test True Y
Generalization error = Test True Y - TestY*

The holdout set method is great - it lets us test our model on
unseen data, the most important metric for any model.
However, one potential problem arises:
What if our test dataset, even though it was picked randomly, is
unrepresentative of the data?
E.g. We managed to pick the 21 weekends in Sam’s dataset where he had just broken
up with his girlfriend, or failed a test, or fought with his friend, and ended up staying
home. Then our test set would say that our model is awful and didn’t predict Y*
accurately.
There are some shortcomings
associated with the hold out method as
a way to do model evaluation.
Performance
Model
evaluation
n=104
train: 83
test: 21
We can do better...

A powerful way to overcome any issues with
a biased single holdout is to run the model
many times.
If a single run of your model is one expert opining on the
data, an ensemble approach gathers a crowd of experts.
Source: Fortmann-Roe, Accurately Measuring Model Prediction Error.
http://scott.fortmann-roe.com/docs/MeasuringError.html
VS.
Our Model Our Model + model friends
Performance
Model
evaluation

Ensemble approaches
1. Bootstrapping
2. Bagging
3. Random forests
Central concept: teamwork!

● Bootstrap
aggregation, or
taking the average of
the predicted Y*s
from bootstrapped
samples
● Random forest is a
bagging method
● We are able to
calculate out-of-bag
error instead of
using test/train set
Ensemble Models: model cheat sheet
● Method of repeated
sampling with
replacement
Bootstrapping Bagging
● Iterative - each tree
learns from the tree
that was run last.
● The algorithm
weights each
training example by
how incorrectly it
was classiﬁed.
Boosting

Bootstrapping, bagging, random forests
and boosting all leverage a crowd of
experts.
Bootstrapping Bagging Random Forest Boosting

Bootstrapping is a resampling
method that takes random
samples with replacement
from whole dataset.
n=104
train: 83
test: 21
Instead of only using one holdout, we
repeatedly construct different holdouts
from the dataset.
Bootstrapping
Example of a single holdout split.
Bootstrapping repeats this many,
many times. We set the number
of holdouts as a
hyperparameter.

Bagging is an implementation of
bootstrapping: it involves taking the average
of the random samples drawn by
bootstrapping.

We train multiple models on
random subsets of the datasets
and average the predictions.
By averaging the predictions, any
chance of unrepresentative
training sets is reduced.
Y*
Y*
Y*
Y*
Y*
}Average
Y*
Bagging improves upon a single holdout by taking the
average predicted Y* of boosted random samples.
Bagging

Which do you think does a
better job of estimating
true Y?
Y*
Y*
Y*
Y*
Y*
Average
Y*
n=104
train: 83
train: 83
train: 83
train: 83
train: 83
}
train: 83 Y*
Bagging Normal holdout
vs.
n=104

In Sam’s case, we still have the problem
unrepresentative train dataset. However, now
that we’re taking different train sets and
averaging them, the chance of an
unrepresentative training set over-inﬂuencing
the Y* is reduced.
Y*
Y*
Y*
Y*
Y*
}Average
Y*
train: 83
train: 83
train: 83
train: 83
train: 83
Bagging tends to always
outperform a single holdout.
Bagging
n=104

Out-of-Bag Score
Another amazing beneﬁt of using bagging
algorithms is the out-of-bag score.
The out-of-bag score is the error rate of
observations not used in each decision tree.
Source:
https://www.quora.com/What-is-the-out-of-bag-error-in-Random-Forests
Bagging Out-of-bag score
Y*
Y*
Y*
Y*
Y*
train: 83
train: 83
train: 83
train: 83
train: 83
n=104
test = 104-83 = 21
test = 21
test = 21
test = 21
test = 21

Out-of-Bag Score
The out-of-bag score is the error rate of
observations not used in each decision tree.
Why it matters:
There is empirical evidence to show that the
out-of-bag estimate is as accurate as using a test
set of the same size as the training set. Therefore,
using the out-of-bag error estimate removes the need
for a set-aside test set.
Source: Breiman, 1996
Y*
Y*
Y*
Y*
Y*
train: 83
train: 83
train: 83
train: 83
train: 83
n=104
test = 104-83 = 21
test = 21
test = 21
test = 21
test = 21

Out-of-Bag Score
Out-of-bag score can be calculated for any
bootstrap aggregation method, including:
- Random forest
- Bagging
- Boosting
Y*
Y*
Y*
Y*
Y*
train: 83
train: 83
train: 83
train: 83
train: 83
n=104
test = 104-83 = 21
test = 21
test = 21
test = 21
test = 21
Is bagging perfect? What are some
potential tradeoffs?

One key trade off is that
training and assessing the
performance of every additional
holdout costs us computational
power and time.
The computational cost is
driven by the data sample size
and number of holdouts.
There are a few key limitations to
bagging.
Y*
Y*
Y*
Y*
Y*
}Average
Y*
Bagging

Subsets of the same data may split on the same features and result in very
similar predictions.
< $50
Y Y NN
>= $50
A key limitation of bagging is that it
may yield correlated (or very similar)
trees.
Bagging
< $50
Y Y NN
>= $50
< $50
Y Y NN
>= $50

Many identical trees becomes an
echo chamber of overﬁtted trees
that repeatedly yields a similar Y*
value, and repeatedly yields the
same important features. This gives
us false conﬁdence in our results.
Y*
Y*
Y*
Y*
Y*
}Average
Y*
n=104
train: 83
train: 83
train: 83
train: 83
train: 83
We can do better ....
You’re doing
“great”!
Budget is the
best feature,
believe me
Correlated trees may give us false
confidence since they repeatedly yield the
same features.
Bagging

Random forest improves on bagging’s
tendency to result in correlated trees.

Random forest improves upon bagging by
only considering a random subset of
features.
Source: https://dimensionless.in/introduction-to-random-forest/
Random forest is an
implementation of bagging. It
improves on bagging by
de-correlating trees.
At every split, it only considers
a random subset of the features.
I’m going to grow a
tree using a, b, c!
tree using a, e, d!
tree using d, e, f!
tree using b, c, d!
Y*
Y*
Y*
Y*
}Average
Y*
Feature Set:
a, b, c, d, e, f
Random Forest

Random forest adjusts
overfit models
Here, we are still using a subset of the
data, but instead of randomly selecting a
number of observations, we randomly
select some number of features.
Random forest helps solve the problem of
overﬁtting.
Note that we can still calculate an accuracy score using OOB
Performance
Improving on
bagging
tree using a, b, c!
tree using a, e, d!
tree using d, e, f!
tree using b, c, d!
Set:
a, b, c, d, e, f

Finally, boosting is a procedure that
iteratively learns by combining many weak
classifiers to produce a powerful
committee.
IMPORTANT NOTE: Boosting is one of the most powerful learning ideas
introduced in the last 20 years. It sounds similar to but is fundamentally
different from bagging and other committee-based approaches.

Boosting also creates subsets of training
data using bootstrap, but each tree learns
from the previous trees: that is, each tree
is not random.
Source: Carnegie Mellon University,
http://www.cs.cmu.edu/~guestrin/Class/10701-S06/Sli
des/decisiontrees-boosting.pdf Our model gets “brighter”
Y*
Unlike random forest, each tree is not
random in boosting
Boosting
How does the model learn?

Boosting uses many weak classifiers to
make a single strong classifier. A weak
classifier is defined as those whose error
rates is only slightly better than random
guessing.
Boosting sequentially applies weak
classification algorithms to repeatedly
modified versions of the data.
How is the data modified?
Our Model gets “brighter”
Y*
Boosting
Combining weak classifiers = one strong
classifier

Each prediction is combined through a
weighted majority vote to produce the final
prediction.
For each iteration, the algorithm weights
higher observations that were classified
incorrectly. This forces the algorithm to
concentrate on training observations that
were classified incorrectly in previous
iterations.
Our Model gets “brighter”
Y*
Boosting
Let’s go through each step of the algorithm
Boosting forces the model to focus in on
hard-to-classify observations

1. Use the whole data set to train a model to
produce Y*
2. Evaluate performance (true Y - Y*)
3. Create training set #2 including
observations that were incorrectly
classiﬁed
4. Repeat steps 2-3
Results in low model error, but there is risk of
overﬁtting
Source:
https://www.analyticsvidhya.com/blog/2015/09/questions-ensemble-
modeling/ Our Model gets “brighter”
Y*
Boosting
Let’s go through step by step:

We’ve covered a lot! By now, you have an arsenal
of supervised learning algorithms to apply in
many situations.
In the next module, we will look at unsupervised
algorithms and what they can tell us.

✓ Ensemble approaches
✓ Bootstrap
✓ Bagging
✓ Random forest
✓ Boosting
Module Checklist:

You are on fire! Go straight to the
next module here.
Need to slow down and digest? Take a
minute to write us an email about
what you thought about the course. All
feedback small or large welcome!
Email: sara@deltanalytics.org

Congrats! You finished
module 6
Find out more about
Delta’s machine
learning for good
mission here.

Module 6: Ensemble Algorithms

More Related Content

What's hot (18)

Similar to Module 6: Ensemble Algorithms (20)

Recently uploaded (20)

Module 6: Ensemble Algorithms