SlideShare a Scribd company logo
BrightTalk
5/20/2014
Building Random
Forest at Scale
Michal Malohlava!
@mmalohlava!
@hexadata
Who am I?
Background
•PhD in CS from Charles University in Prague, 2012
•1 year PostDoc at Purdue University experimenting with algos
for large computation
•1 year at 0xdata helping to develop H2O engine for big data
computation
!
Experience with domain-specific languages, distributed system,
software engineering, and big data.
Overview
1. A little bit of theory
!
2.Random Forest observations
!
3.Scaling & distribution of Random Forest

4.Q&A

Tree Planting
What is a model for
this data?
•Training sample of points covering area [0,3] x [0,3]
•Two possible colors of points
X
0 1 2 3
0
1
2
3
Y
What is a model for
this data?
The model should be able to predict a color of a new point
X
0 1 2 3
0
1
2
3
Y
What is a color !
of this point?
Decision tree
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
A. Possible impurity measures
• Gini, entropy, RSS
B. Respect type of feature -
nominal, ordinal, continuous
How to grow a
decision tree?!
Split rows in a given node
into two sets with respect to
impurity measure
•The smaller, the more
skewed is distribution
•Compare impurity of
parent with impurity of
children
x<0.8
???y < 2
y<0.8
When to stop growing
the tree?
1. Build full tree
or
2.Apply stopping criterion - limit on:
•Tree depth, or
•Minimum number of points in a leaf
x<0.8
y<0.8
x< 2 x<1.7
y < 2
y<0.8
How to assign leaf
value?
The leaf value is
•If leaf contains only one point
then its color represents leaf
value
!
•Else majority color is picked, or
color distribution is stored
x<0.8
y < 2
y<0.8 ???
Decision tree
Tree covered whole area by rectangles predicting a
point color
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
Decision tree scoring
The model can predict a point color based on its
coordinates.
x<0.8
y<0.8
x< 2 x<1.7
y < 2 y<2.3
y < 2
y<0.8
X
0 1 2 3
0
1
2
3
Y
X
0 1 2 3
0
1
2
3
Y
Overfitting
Tree perfectly represents training data (0% training error),
but also learned about noise!
Noise
X
0 1 2 3
0
1
2
3
Y
Overfitting
And hence poorly predicts a new point!
Expected color !
was RED!
since points follow !
“chess board”!
pattern
Handle overfitting
Pre-pruning via stopping criterion
!
Post-pruning: decreases complexity of model but helps with
model generalization
!
Randomize tree building and combine trees together
“Model should have low training error but also
generalization error!”
Random Forest idea
Breiman, L. (2001). Random forests. Machine Learning, 5–32.
http://link.springer.com/article/10.1023/A:1010933404324
Randomize #1
Bagging
X
0 1 2 3
0
1
2
3
Y
X
0 1 2 3
0
1
2
3
Y
Prepare bootstrap
sample for
each tree
by sampling
with replacement
X
0 1 2 3
0
1
2
3
Y
x<0.8
y<0.8
x< 2 x< 1.7
y < 2 y < 2.3
y < 2
y<0.8
Randomize #1
Bagging
x<0.8
y<0.8
x< 2 x< 1.7
y < 2 y < 2.3
y < 2
y<0.8
Randomize #1!
Bagging
Each tree sees only sample of training data and
captures only a part of the information.
!
Build multiple weak trees which vote together to
give resulting prediction
•voting is based on majority vote, or weighted
average
Randomize #2!
Feature selection
Randomized split selection
x<0.8
???y < 2
y<0.8
•Select randomly subset
of features of size sqrt(#features)

•Select the best split only using the
subset
Out of bag points
and validation
Each tree is built over a
sample of training points.
!
Remaining points are called
“out-of-bag” (OOB).
X
0 1 2 3
0
1
2
3
Y
These points are used for validation
as a good approximation for
generalization error. Almost
identical as N-fold cross validation.
Advantages of
Random Forest
Independent trees which can be built in parallel
!
The model does not overfit easily
!
Produces reasonable accuracy
!
Brings more features to analyze data - variable importance,
proximities, missing values imputation
!
Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/~breiman/
RandomForests/cc_home.htm
A Few
Observations
Covtype dataset
Sampling rate impact
Number of split features
Variable importance
Building Forests
with H2O
H2O platform
Challenges
Parallelize and distribute Random Forest algorithm
•Preserve computation with data
•Minimize data transfers
Preserve Random Forest properties
•Split nodes in an efficient way
•Sample and preserve track of OOB samples
Handle large trees
Implementation #1
Build independent trees per machine local data
•RVotes approach
•Each node builds a subset of forest
Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and
accurate approach. The Journal of Machine Learning Research, 5, p421–451.
Implementation #1
Fast - trees are independent and can be built in parallel
Data have to fit into memory
!
Possible accuracy decrease if each node can see only
subset of data
Implementation #2
Build a distributed tree over all data
Dataset!
points
Implementation #2
Each data point has assigned a tree node
• stored in a temporary

vector
Dataset!
points
Pass over data points
means visiting tree nodes
Implementation #2
Each data point has in/out of bag flag
• stored in a temporary

vector
Dataset!
points
I O I I I O I II I I I I O
In/Out of!
bag !
flags
Trick for on-the-fly scoring:
position out-of-bag rows
inside a tree is tracked as
well
Implementation #2
Tree is built per layer
• Each node prepares

histogram for

splitting Active !
tree layer
Dataset!
points
I O I I I O I II I I I I O
In/Out of!
bag !
flags
Implementation #2
Tree is built per layer
• Histograms are reduced

and a new layer

is prepared Active !
tree layer
Dataset!
points
I O I I I O I II I I I I O
In/Out of!
bag !
flags
Implementation #2
Exact solution - no decrease of accuracy

Elegant solution merging tree building and OOB scoring
!
More data transfers to exchange histograms
Can produce huge trees (since tree size depends on
data)
Tree representation
Internally stored in compressed format as a byte array
•But it can be pretty huge (>10MB)
!
Externally can be stored as code
class Tree_1 {	
static final float predict(double[] data) {	
float pred = ((float) data[3 /* petal_wid */] < 0.8200002f ? 0.0f	
: ((float) data[2 /* petal_len */] < 4.835f	
? ((float) data[3 /* petal_wid */] < 1.6600002f ? 1.0f : 0.0f)	
: ((float) data[2 /* petal_len */] < 5.14475f	
? ((float) data[3 /* petal_wid */] < 1.7440002f	
? ((float) data[1 /* sepal_wid */] < 2.3600001f ? 0.0f : 1.0f)	
: 0.0f)	
: 0.0f)));	
return pred;	
}	
}
Lesson learned
Preserving deterministic computation is crucial!
!
Trees need to be sent around the cloud for validation which
can be expensive!
!
Tracking out-of-bag points can be tricky!
!
Clever data binning is a key trick to decrease memory
consumption
Time for questions
Thank you!
Learn more about H2O at
0xdata.com
or
git clone https://github.com/0xdata/h2o
Thank you!
Follow us at @hexadata
References
•0xdata, H2O: https://github.com/0xdata/h2o/
•Breiman, L. (1999). Pasting small votes for classification in large
databases and on-line. Machine Learning, Vol 36. Kluwer, p85–103.
•Breiman, L. (2001). Random forests. Machine Learning, 5–32. Retrieved
from http://link.springer.com/article/10.1023/A:1010933404324
•Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/
~breiman/RandomForests/cc_home.htm
•Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable
and accurate approach. The Journal of Machine Learning Research, 5,
p421–451.
•Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical
learning. cs.yale.edu. Retrieved from http://link.springer.com/content/
pdf/10.1007/978-0-387-84858-7.pdf

More Related Content

PDF
PDF
Informed search
PDF
Machine Learning Algorithm - Decision Trees
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PPTX
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
PPTX
Supervised and unsupervised learning
PPTX
Machine Learning - Ensemble Methods
Informed search
Machine Learning Algorithm - Decision Trees
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Supervised and unsupervised learning
Machine Learning - Ensemble Methods

What's hot (20)

PDF
Introduction to XGBoost
PPT
Decision tree and random forest
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PPTX
K nearest neighbor
PPTX
Bayesian Neural Networks
PPTX
Support Vector Machine ppt presentation
PPTX
Decision Tree - C4.5&CART
PPTX
Random forest
PDF
YOLOv4: optimal speed and accuracy of object detection review
PPTX
Deep Reinforcement Learning
PPTX
DBSCAN (2014_11_25 06_21_12 UTC)
PPTX
Decision Tree Learning
PPT
3.3 hierarchical methods
PDF
Support Vector Machines ( SVM )
PPTX
ML - Simple Linear Regression
PPTX
Classification decision tree
PPTX
Regularization in deep learning
PPTX
Xgboost: A Scalable Tree Boosting System - Explained
PDF
파이썬으로 나만의 강화학습 환경 만들기
PPTX
Decision trees for machine learning
Introduction to XGBoost
Decision tree and random forest
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
K nearest neighbor
Bayesian Neural Networks
Support Vector Machine ppt presentation
Decision Tree - C4.5&CART
Random forest
YOLOv4: optimal speed and accuracy of object detection review
Deep Reinforcement Learning
DBSCAN (2014_11_25 06_21_12 UTC)
Decision Tree Learning
3.3 hierarchical methods
Support Vector Machines ( SVM )
ML - Simple Linear Regression
Classification decision tree
Regularization in deep learning
Xgboost: A Scalable Tree Boosting System - Explained
파이썬으로 나만의 강화학습 환경 만들기
Decision trees for machine learning
Ad

Similar to Building Random Forest at Scale (20)

PDF
Jan vitek distributedrandomforest_5-2-2013
PDF
Adam Ashenfelter - Finding the Oddballs
PPTX
RandomForests_Sayed-tree based model.pptx
PDF
Memory efficient java tutorial practices and challenges
PDF
Interactive Latency in Big Data Visualization
PPTX
Machine Learning with Python unit-2.pptx
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PPTX
High Dimensional Indexing using MongoDB (MongoSV 2012)
PDF
Performance and predictability
PPT
Scalable Machine Learning: The Role of Stratified Data Sharding
PPTX
Making Machine Learning Scale: Single Machine and Distributed
PDF
How to win data science competitions with Deep Learning
PDF
Performance and predictability
PPTX
Machine Learning Summary for Caltech2
PDF
A Modern Introduction to Decision Tree Ensembles
PDF
Building Data Products
PDF
Ensembles.pdf
PPTX
Natural Language Processing in R (rNLP)
PDF
Topological Data Analysis
Jan vitek distributedrandomforest_5-2-2013
Adam Ashenfelter - Finding the Oddballs
RandomForests_Sayed-tree based model.pptx
Memory efficient java tutorial practices and challenges
Interactive Latency in Big Data Visualization
Machine Learning with Python unit-2.pptx
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
Distributed Decision Tree Learning for Mining Big Data Streams
High Dimensional Indexing using MongoDB (MongoSV 2012)
Performance and predictability
Scalable Machine Learning: The Role of Stratified Data Sharding
Making Machine Learning Scale: Single Machine and Distributed
How to win data science competitions with Deep Learning
Performance and predictability
Machine Learning Summary for Caltech2
A Modern Introduction to Decision Tree Ensembles
Building Data Products
Ensembles.pdf
Natural Language Processing in R (rNLP)
Topological Data Analysis
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
A Presentation on Artificial Intelligence
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
cloud_computing_Infrastucture_as_cloud_p
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation theory and applications.pdf
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles - August'25-Week II
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A comparative analysis of optical character recognition models for extracting...
Spectroscopy.pptx food analysis technology
Assigned Numbers - 2025 - Bluetooth® Document
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cloud_computing_Infrastucture_as_cloud_p

Building Random Forest at Scale

  • 1. BrightTalk 5/20/2014 Building Random Forest at Scale Michal Malohlava! @mmalohlava! @hexadata
  • 2. Who am I? Background •PhD in CS from Charles University in Prague, 2012 •1 year PostDoc at Purdue University experimenting with algos for large computation •1 year at 0xdata helping to develop H2O engine for big data computation ! Experience with domain-specific languages, distributed system, software engineering, and big data.
  • 3. Overview 1. A little bit of theory ! 2.Random Forest observations ! 3.Scaling & distribution of Random Forest
 4.Q&A

  • 5. What is a model for this data? •Training sample of points covering area [0,3] x [0,3] •Two possible colors of points X 0 1 2 3 0 1 2 3 Y
  • 6. What is a model for this data? The model should be able to predict a color of a new point X 0 1 2 3 0 1 2 3 Y What is a color ! of this point?
  • 7. Decision tree x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  • 8. A. Possible impurity measures • Gini, entropy, RSS B. Respect type of feature - nominal, ordinal, continuous How to grow a decision tree?! Split rows in a given node into two sets with respect to impurity measure •The smaller, the more skewed is distribution •Compare impurity of parent with impurity of children x<0.8 ???y < 2 y<0.8
  • 9. When to stop growing the tree? 1. Build full tree or 2.Apply stopping criterion - limit on: •Tree depth, or •Minimum number of points in a leaf x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<0.8
  • 10. How to assign leaf value? The leaf value is •If leaf contains only one point then its color represents leaf value ! •Else majority color is picked, or color distribution is stored x<0.8 y < 2 y<0.8 ???
  • 11. Decision tree Tree covered whole area by rectangles predicting a point color x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  • 12. Decision tree scoring The model can predict a point color based on its coordinates. x<0.8 y<0.8 x< 2 x<1.7 y < 2 y<2.3 y < 2 y<0.8 X 0 1 2 3 0 1 2 3 Y
  • 13. X 0 1 2 3 0 1 2 3 Y Overfitting Tree perfectly represents training data (0% training error), but also learned about noise! Noise
  • 14. X 0 1 2 3 0 1 2 3 Y Overfitting And hence poorly predicts a new point! Expected color ! was RED! since points follow ! “chess board”! pattern
  • 15. Handle overfitting Pre-pruning via stopping criterion ! Post-pruning: decreases complexity of model but helps with model generalization ! Randomize tree building and combine trees together “Model should have low training error but also generalization error!” Random Forest idea Breiman, L. (2001). Random forests. Machine Learning, 5–32. http://link.springer.com/article/10.1023/A:1010933404324
  • 16. Randomize #1 Bagging X 0 1 2 3 0 1 2 3 Y X 0 1 2 3 0 1 2 3 Y Prepare bootstrap sample for each tree by sampling with replacement X 0 1 2 3 0 1 2 3 Y
  • 17. x<0.8 y<0.8 x< 2 x< 1.7 y < 2 y < 2.3 y < 2 y<0.8 Randomize #1 Bagging x<0.8 y<0.8 x< 2 x< 1.7 y < 2 y < 2.3 y < 2 y<0.8
  • 18. Randomize #1! Bagging Each tree sees only sample of training data and captures only a part of the information. ! Build multiple weak trees which vote together to give resulting prediction •voting is based on majority vote, or weighted average
  • 19. Randomize #2! Feature selection Randomized split selection x<0.8 ???y < 2 y<0.8 •Select randomly subset of features of size sqrt(#features)
 •Select the best split only using the subset
  • 20. Out of bag points and validation Each tree is built over a sample of training points. ! Remaining points are called “out-of-bag” (OOB). X 0 1 2 3 0 1 2 3 Y These points are used for validation as a good approximation for generalization error. Almost identical as N-fold cross validation.
  • 21. Advantages of Random Forest Independent trees which can be built in parallel ! The model does not overfit easily ! Produces reasonable accuracy ! Brings more features to analyze data - variable importance, proximities, missing values imputation ! Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/~breiman/ RandomForests/cc_home.htm
  • 25. Number of split features
  • 29. Challenges Parallelize and distribute Random Forest algorithm •Preserve computation with data •Minimize data transfers Preserve Random Forest properties •Split nodes in an efficient way •Sample and preserve track of OOB samples Handle large trees
  • 30. Implementation #1 Build independent trees per machine local data •RVotes approach •Each node builds a subset of forest Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and accurate approach. The Journal of Machine Learning Research, 5, p421–451.
  • 31. Implementation #1 Fast - trees are independent and can be built in parallel Data have to fit into memory ! Possible accuracy decrease if each node can see only subset of data
  • 32. Implementation #2 Build a distributed tree over all data Dataset! points
  • 33. Implementation #2 Each data point has assigned a tree node • stored in a temporary
 vector Dataset! points Pass over data points means visiting tree nodes
  • 34. Implementation #2 Each data point has in/out of bag flag • stored in a temporary
 vector Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags Trick for on-the-fly scoring: position out-of-bag rows inside a tree is tracked as well
  • 35. Implementation #2 Tree is built per layer • Each node prepares
 histogram for
 splitting Active ! tree layer Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags
  • 36. Implementation #2 Tree is built per layer • Histograms are reduced
 and a new layer
 is prepared Active ! tree layer Dataset! points I O I I I O I II I I I I O In/Out of! bag ! flags
  • 37. Implementation #2 Exact solution - no decrease of accuracy
 Elegant solution merging tree building and OOB scoring ! More data transfers to exchange histograms Can produce huge trees (since tree size depends on data)
  • 38. Tree representation Internally stored in compressed format as a byte array •But it can be pretty huge (>10MB) ! Externally can be stored as code class Tree_1 { static final float predict(double[] data) { float pred = ((float) data[3 /* petal_wid */] < 0.8200002f ? 0.0f : ((float) data[2 /* petal_len */] < 4.835f ? ((float) data[3 /* petal_wid */] < 1.6600002f ? 1.0f : 0.0f) : ((float) data[2 /* petal_len */] < 5.14475f ? ((float) data[3 /* petal_wid */] < 1.7440002f ? ((float) data[1 /* sepal_wid */] < 2.3600001f ? 0.0f : 1.0f) : 0.0f) : 0.0f))); return pred; } }
  • 39. Lesson learned Preserving deterministic computation is crucial! ! Trees need to be sent around the cloud for validation which can be expensive! ! Tracking out-of-bag points can be tricky! ! Clever data binning is a key trick to decrease memory consumption
  • 41. Learn more about H2O at 0xdata.com or git clone https://github.com/0xdata/h2o Thank you! Follow us at @hexadata
  • 42. References •0xdata, H2O: https://github.com/0xdata/h2o/ •Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine Learning, Vol 36. Kluwer, p85–103. •Breiman, L. (2001). Random forests. Machine Learning, 5–32. Retrieved from http://link.springer.com/article/10.1023/A:1010933404324 •Breiman, L., Cutler, A. Random Forest. http://www.stat.berkeley.edu/ ~breiman/RandomForests/cc_home.htm •Chawla, N., & Hall, L. (2004). Learning ensembles from bites: A scalable and accurate approach. The Journal of Machine Learning Research, 5, p421–451. •Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. cs.yale.edu. Retrieved from http://link.springer.com/content/ pdf/10.1007/978-0-387-84858-7.pdf