SlideShare a Scribd company logo
Random Forest Algorithm for Regression
Abhay Kumar • September 17, 2018  0  536
Data Science and Artificial Intelligence
This entry is part 7 of 17 in the series Machine Learning Algorithms
Introduction to Random Forest Algorithm:
The goal of the blog post is to equip beginners with the basics of the Random Forest algorithm so that they can build
their first model easily.
Ensemble methods are supervised learning models which combine the predictions of multiple smaller models to
improve predictive power and generalization.
The smaller models that combine to make the ensemble model are referred to as base models. Ensemble methods
often result in considerably higher performance than any of the individual base models.
Free Step-by-step Guide To
Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
Name Email
Phone
Download Now!
Click to Refer
Refer your Friends for Data Science Course and earn INR 10,000 / $200 per referral
×
Two popular families of ensemble methods
BAGGING
Several estimators are built independently on subsets of the data and their predictions are averaged. Typically, the
combined estimator is usually better than any of the single base estimator.
Bagging can reduce variance with little to no effect on bias.
ex: Random Forests
BOOSTING
Base estimators are built sequentially. Each subsequent estimator focuses on the weaknesses of the previous
estimators. In essence, several weak models “team up” to produce a powerful ensemble model. 
Boosting can reduce bias without incurring higher variance.
ex: Gradient Boosted Trees, AdaBoost
Bagging
The ensemble method we will be using today is called bagging, which is short for bootstrap aggregating.
Bagging builds multiple base models with resampled training data with replacement. We train k base classifiers on k
different samples of training data. Using random subsets of the data to train base models promotes more differences
between the base models.
We can use the BaggingRegressor class to form an ensemble of regressors. One such Bagging algorithms are random
forest regressor. A random forest regressor is a meta estimator that fits a number of classifying decision trees on
various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The
sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if
bootstrap=True (default).
Random Forest Regressors uses some kind of splitting criterion to measure the quality of a split. Supported criteria are
“MSE” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “Mean
Absolute Error” for the mean absolute error.
Problem Statement:
To predict the median prices of homes located in the Boston area given other attributes of the house.
Data details
Boston House Prices dataset
    
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression problems.   
Tools used:
Pandas
Numpy
Matplotlib
scikit-learn
Import necessary libraries
Import the necessary modules from specific libraries.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
Load the data set
Use the pandas module to read the taxi data from the file system. Check few records of the dataset.
# #############################################################################
# Load data
boston = datasets.load_boston()
print(boston.data.shape, boston.target.shape)
print(boston.feature_names)
(506, 13) (506,)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
data = pd.DataFrame(boston.data,columns=boston.feature_names)
data = pd.concat([data,pd.Series(boston.target,name='MEDV')],axis=1)
data.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
Select the predictor and target variables
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
Train test split :
x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y,test_size=0.10,
random_state=42,
shuffle=True)
Training/model fitting:
Fit the model to selected supervised data
n_estimators=100
# Fit regression model
# Estimate the score on the entire dataset, with no missing values
model = RandomForestRegressor(random_state=0, n_estimators=n_estimators)
model.fit(x_training_set, y_training_set)
Model parameters study :
The coefficient R^2 is defined as (1 – u/v), where u is the residual sum of squares ((y_true – y_pred) ** 2).sum() and v is
the total sum of squares ((y_true – y_true.mean()) ** 2).sum().
from sklearn.m etrics import mean_squared_error, r2_score
model_score = model.score(x_training_set,y_training_set)
# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print(“ coefficient of determination R^2 of the prediction.: ',model_score)
y_predicted = model.predict(x_test_set)
# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y_test_set, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Test Variance score: %.2f' % r2_score(y_test_set, y_predicted))
Coefficient of determination R^2 of the prediction : 0.982022598521334
Mean squared error: 7.73
Test Variance score: 0.88
Accuracy report with test data :
Let’s visualize the goodness of the fit with the predictions being visualized by a line
# So let's run the model against the test data
This site uses Akismet to reduce spam. Learn how your comment data is processed.
<< Using Gradient Boosting for Regression Problems Linear Regression >>
 Edit Post
from sklearn.model_selection import cross_val_predict
fig, ax = plt.subplots()
ax.scatter(y_test_set, y_predicted, edgecolors=(0, 0, 0))
ax.plot([y_test_set.min(), y_test_set.max()], [y_test_set.min(), y_test_set.max()], 'k--', lw=4)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Ground Truth vs Predicted")
plt.show()
Conclusion:
We can see that our R2 score and MSE are both very good. This means that we have found a well-fitting model to
predict the median price value of a house. There can be a further improvement to the metric by doing some
preprocessing before fitting the data.
Series Navigation
Related
Random Forest
July 12, 2018
In "Data Science and Artificial
Intelligence"
Introduction to Machine Learning
Using Spark
January 2, 2017
In "Big Data Hadoop & Spark"
Data Science Glossary- Machine
Learning Tools and Terminologies
May 3, 2018
In "All Categories"

More Related Content

What's hot (20)

PPTX
Random forest
Musa Hawamdah
 
PDF
[Paper Review] GAIN: Missing Data Imputation using Generative Adversarial Net...
Jihoo Kim
 
PDF
Random forest sgv_ai_talk_oct_2_2018
digitalzombie
 
PDF
ITB Term Paper - 10BM60066
rahulsm27
 
PDF
Technique Presentation
Elizabeth Rego
 
PDF
Clustering
Learnbay Datascience
 
PDF
Hierarchical clustering
Learnbay Datascience
 
PPTX
Competition16
Saurabh Vashist
 
PDF
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Sri Ambati
 
PPTX
House Sale Price Prediction
sriram30691
 
PDF
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Sri Ambati
 
PDF
Understanding the Machine Learning Algorithms
Rupak Roy
 
DOCX
leslie
Yogi Sarumaha
 
PDF
Data Science - Part III - EDA & Model Selection
Derek Kane
 
PPTX
PyPedal, an open source software package for pedigree analysis
John B. Cole, Ph.D.
 
PDF
VSSML16 L2. Ensembles and Logistic Regression
BigML, Inc
 
PDF
Data Science Machine
Luis Taveras EMBA, MS
 
PDF
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
csandit
 
PDF
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul
Deepak Agarwal
 
PDF
CounterFactual Explanations.pdf
Bong-Ho Lee
 
Random forest
Musa Hawamdah
 
[Paper Review] GAIN: Missing Data Imputation using Generative Adversarial Net...
Jihoo Kim
 
Random forest sgv_ai_talk_oct_2_2018
digitalzombie
 
ITB Term Paper - 10BM60066
rahulsm27
 
Technique Presentation
Elizabeth Rego
 
Hierarchical clustering
Learnbay Datascience
 
Competition16
Saurabh Vashist
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Sri Ambati
 
House Sale Price Prediction
sriram30691
 
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Sri Ambati
 
Understanding the Machine Learning Algorithms
Rupak Roy
 
Data Science - Part III - EDA & Model Selection
Derek Kane
 
PyPedal, an open source software package for pedigree analysis
John B. Cole, Ph.D.
 
VSSML16 L2. Ensembles and Logistic Regression
BigML, Inc
 
Data Science Machine
Luis Taveras EMBA, MS
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
csandit
 
Volume 14 issue 03 march 2014_ijcsms_march14_10_14_rahul
Deepak Agarwal
 
CounterFactual Explanations.pdf
Bong-Ho Lee
 

Similar to Random forest algorithm for regression a beginner's guide (20)

PDF
Gradient boosting for regression problems with example basics of regression...
prateek kumar
 
PDF
Regression Analysis and model comparison on the Boston Housing Data
Shivaram Prakash
 
PDF
Data Science Cheatsheet.pdf
qawali1
 
PPTX
End-to-End Machine Learning Project
Eng Teong Cheah
 
PPTX
Machine learning using spark
Ran Silberman
 
PPTX
MLU_DTE_Lecture_2.pptx
RahulChaudhry15
 
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
Leo Salemann
 
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
Karunakar Kotha
 
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
Wenfan Xu
 
PDF
Unsupervised learning
AlexAman1
 
PPTX
random forest.pptx
PriyadharshiniG41
 
PPTX
Building and deploying analytics
Collin Bennett
 
DOCX
AIMLProgram-6 AIMLProgram-6 AIMLProgram-6 AIMLProgram-6
RaghuBR9
 
PPTX
FinalPresentation-GradProject
Manabu Mukohyoshi
 
PPTX
Machine Learning Algorithms (Part 1)
Zihui Li
 
PDF
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
PDF
Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf
KalighatOkira
 
PDF
ML MODULE 2.pdf
Shiwani Gupta
 
PPTX
Nimrita koul Machine Learning
Nimrita Koul
 
Gradient boosting for regression problems with example basics of regression...
prateek kumar
 
Regression Analysis and model comparison on the Boston Housing Data
Shivaram Prakash
 
Data Science Cheatsheet.pdf
qawali1
 
End-to-End Machine Learning Project
Eng Teong Cheah
 
Machine learning using spark
Ran Silberman
 
MLU_DTE_Lecture_2.pptx
RahulChaudhry15
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Leo Salemann
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Karunakar Kotha
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Wenfan Xu
 
Unsupervised learning
AlexAman1
 
random forest.pptx
PriyadharshiniG41
 
Building and deploying analytics
Collin Bennett
 
AIMLProgram-6 AIMLProgram-6 AIMLProgram-6 AIMLProgram-6
RaghuBR9
 
FinalPresentation-GradProject
Manabu Mukohyoshi
 
Machine Learning Algorithms (Part 1)
Zihui Li
 
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf
KalighatOkira
 
ML MODULE 2.pdf
Shiwani Gupta
 
Nimrita koul Machine Learning
Nimrita Koul
 
Ad

Recently uploaded (20)

PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PDF
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
PPTX
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
PDF
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PPT
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
PPSX
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
 
PDF
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
PDF
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
PDF
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
 
PPTX
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
PDF
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
PDF
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
PPTX
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
PPTX
Functions in Python Programming Language
BeulahS2
 
PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PDF
Plant Control_EST_85520-01_en_AllChanges_20220127.pdf
DarshanaChathuranga4
 
PPTX
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
PPTX
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
 
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
Functions in Python Programming Language
BeulahS2
 
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
Plant Control_EST_85520-01_en_AllChanges_20220127.pdf
DarshanaChathuranga4
 
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
Ad

Random forest algorithm for regression a beginner's guide

  • 1. Random Forest Algorithm for Regression Abhay Kumar • September 17, 2018  0  536 Data Science and Artificial Intelligence This entry is part 7 of 17 in the series Machine Learning Algorithms Introduction to Random Forest Algorithm: The goal of the blog post is to equip beginners with the basics of the Random Forest algorithm so that they can build their first model easily. Ensemble methods are supervised learning models which combine the predictions of multiple smaller models to improve predictive power and generalization. The smaller models that combine to make the ensemble model are referred to as base models. Ensemble methods often result in considerably higher performance than any of the individual base models. Free Step-by-step Guide To Become A Data Scientist Subscribe and get this detailed guide absolutely FREE Name Email Phone Download Now! Click to Refer Refer your Friends for Data Science Course and earn INR 10,000 / $200 per referral ×
  • 2. Two popular families of ensemble methods BAGGING Several estimators are built independently on subsets of the data and their predictions are averaged. Typically, the combined estimator is usually better than any of the single base estimator. Bagging can reduce variance with little to no effect on bias. ex: Random Forests BOOSTING Base estimators are built sequentially. Each subsequent estimator focuses on the weaknesses of the previous estimators. In essence, several weak models “team up” to produce a powerful ensemble model.  Boosting can reduce bias without incurring higher variance. ex: Gradient Boosted Trees, AdaBoost Bagging The ensemble method we will be using today is called bagging, which is short for bootstrap aggregating. Bagging builds multiple base models with resampled training data with replacement. We train k base classifiers on k different samples of training data. Using random subsets of the data to train base models promotes more differences between the base models. We can use the BaggingRegressor class to form an ensemble of regressors. One such Bagging algorithms are random forest regressor. A random forest regressor is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). Random Forest Regressors uses some kind of splitting criterion to measure the quality of a split. Supported criteria are “MSE” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “Mean Absolute Error” for the mean absolute error. Problem Statement: To predict the median prices of homes located in the Boston area given other attributes of the house. Data details Boston House Prices dataset     
  • 3. =========================== Notes ------ Data Set Characteristics: :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive :Median Value (attribute 14) is usually the target :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. http://archive.ics.uci.edu/ml/datasets/Housing This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems.    Tools used: Pandas Numpy Matplotlib scikit-learn Import necessary libraries Import the necessary modules from specific libraries. import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.metrics import mean_squared_error from sklearn.ensemble import RandomForestRegressor Load the data set Use the pandas module to read the taxi data from the file system. Check few records of the dataset. # ############################################################################# # Load data
  • 4. boston = datasets.load_boston() print(boston.data.shape, boston.target.shape) print(boston.feature_names) (506, 13) (506,) ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] data = pd.DataFrame(boston.data,columns=boston.feature_names) data = pd.concat([data,pd.Series(boston.target,name='MEDV')],axis=1) data.head() CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2 Select the predictor and target variables X = data.iloc[:,:-1] y = data.iloc[:,-1] Train test split : x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y,test_size=0.10, random_state=42, shuffle=True) Training/model fitting: Fit the model to selected supervised data n_estimators=100 # Fit regression model # Estimate the score on the entire dataset, with no missing values model = RandomForestRegressor(random_state=0, n_estimators=n_estimators) model.fit(x_training_set, y_training_set) Model parameters study : The coefficient R^2 is defined as (1 – u/v), where u is the residual sum of squares ((y_true – y_pred) ** 2).sum() and v is the total sum of squares ((y_true – y_true.mean()) ** 2).sum(). from sklearn.m etrics import mean_squared_error, r2_score model_score = model.score(x_training_set,y_training_set) # Have a look at R sq to give an idea of the fit , # Explained variance score: 1 is perfect prediction print(“ coefficient of determination R^2 of the prediction.: ',model_score) y_predicted = model.predict(x_test_set) # The mean squared error print("Mean squared error: %.2f"% mean_squared_error(y_test_set, y_predicted)) # Explained variance score: 1 is perfect prediction print('Test Variance score: %.2f' % r2_score(y_test_set, y_predicted)) Coefficient of determination R^2 of the prediction : 0.982022598521334 Mean squared error: 7.73 Test Variance score: 0.88 Accuracy report with test data : Let’s visualize the goodness of the fit with the predictions being visualized by a line # So let's run the model against the test data
  • 5. This site uses Akismet to reduce spam. Learn how your comment data is processed. << Using Gradient Boosting for Regression Problems Linear Regression >>  Edit Post from sklearn.model_selection import cross_val_predict fig, ax = plt.subplots() ax.scatter(y_test_set, y_predicted, edgecolors=(0, 0, 0)) ax.plot([y_test_set.min(), y_test_set.max()], [y_test_set.min(), y_test_set.max()], 'k--', lw=4) ax.set_xlabel('Actual') ax.set_ylabel('Predicted') ax.set_title("Ground Truth vs Predicted") plt.show() Conclusion: We can see that our R2 score and MSE are both very good. This means that we have found a well-fitting model to predict the median price value of a house. There can be a further improvement to the metric by doing some preprocessing before fitting the data. Series Navigation Related Random Forest July 12, 2018 In "Data Science and Artificial Intelligence" Introduction to Machine Learning Using Spark January 2, 2017 In "Big Data Hadoop & Spark" Data Science Glossary- Machine Learning Tools and Terminologies May 3, 2018 In "All Categories"