SlideShare a Scribd company logo
Machine Learning with Python
Machine Learning Algorithms - RANDOM FOREST
Prof.ShibdasDutta,
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Machine Learning Algorithms – Classification Algo- RANDOM FOREST
Introduction - RANDOM FOREST
As the name suggests, the Random forest is a “forest” of trees! i.e Decision Trees.
A random forest is a tree-based machine learning algorithm that randomly selects
specific features to build multiple decision trees.
The random forest then combines the output of individual decision trees to generate
the final output.
Decision trees involve the greedy selection to the best split point from the dataset at
each step.
We can use random forest for classification as well as regression problems.
If the total number of column in the training dataset is denoted by p :
We take sqrt(p) number of columns for classification
For regression, we take a p/3 number of columns.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
WHEN TO USE RANDOM FOREST ?
When we focus on accuracy rather than interpretation
If you want better accuracy on the unexpected validation dataset
HOW TO USE RANDOM FOREST ?
Select random samples from a given dataset
Construct decision trees from every sample and obtain their output
Perform a vote for each predicted result.
Most voted prediction is selected as the final prediction result.
Random Forest
Training
Sample 1
Training
Sample 2
Voting
Prediction
Training
Sample 1
Training
Sample n
Training
Sample 1
Training
Sample 1
Training Set
Test Set
The following diagram will illustrate its working:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
STOCK PREDICTION USING RANDOM FOREST-EXAMPLE
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
data = pd.read_csv('data.csv')
data.head()
Here, we will be using the dataset (available below) which contains seven columns namely date, open, high, low, close,
volume, and name of the company.
Here in this case google is the only company we have used.
Open refers to the time at which people can begin trading on a particular exchange.
Low represents a lower price point for a stock or index.
High refers to a market milestone in which a stock or index reaches a greater price point than previously for a particular
time period.
Close simply refers to the time at which a stock exchange closes to trading.
Volume refers to the number of shares of stock traded during a particular time period, normally measured in average
daily trading volume.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
abc=[]
for i in range(len(data)):
abc.append(data['Date'][i].split('-'))
data['Date'][i] = ''.join(abc[i])
Using above dataset, we are now trying to predict the ‘Close’ Value based on all attributes. Let’s split the data into
train and test dataset.
#These are the labels: They describe what the stock price was over a period.
X_1 = data.drop('Close',axis=1)
Y_1 = data['Close']
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, Y_1, test_size=0.33, random_state=42)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Now, let’s instantiate the model and train the model on training dataset:
rfg = RandomForestRegressor(n_estimators= 10, random_state=42)
rfg.fit(X_train_1,y_train_1)
pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(
drop=True)], axis=1)
Let’s find out the features on the basis of their importance by calculating numerical feature importances
# Saving feature names for later use
feature_list = list(X_1.columns)
print(feature_list)
# Get numerical feature importances
importances = list(rfg.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
rfg.score(X_test_1, y_test_1)
We are getting an accuracy of ~99% while predicting. We then display the original value and the predicted Values.
pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(drop=True)], axis=1)
Prediction
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
ADVANTAGES OF RANDOM FOREST
It reduces overfitting as it yields prediction based on majority voting.
Random forest can be used for classification as well as regression.
It works well on a large range of datasets.
Random forest provides better accuracy on unseen data and even if some data is missing
Data normalization isn’t required as it is a rule-based approach
DISADVANTAGES
Random forest requires much more computational power and memory space to build numerous decision trees.
Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each
variable.
Random forests can be less intuitive for a large collection of decision trees.
Using bagging techniques, Random forest makes trees only which are dependent on each other. Bagging might provide
similar predictions in each tree as the same greedy algorithm is used to create each decision tree. Hence, it is likely to be
using the same or very similar split points in each tree which mitigates the variance originally sought.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Thank You
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Ad

Recommended

Basics of R
Basics of R
Sachita Yadav
 
Decision Tree.pptx
Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Lesson 2 data preprocessing
Lesson 2 data preprocessing
AbdurRazzaqe1
 
Get started with R lang
Get started with R lang
senthil0809
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
Rai University
 
Lecture 3 intro2data
Lecture 3 intro2data
Johnson Ubah
 
Introduction to Data structure and algorithm.pptx
Introduction to Data structure and algorithm.pptx
line24arts
 
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
KalighatOkira
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
Rai University
 
python-pandas-For-Data-Analysis-Manipulate.pptx
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
Chapter 02-logistic regression
Chapter 02-logistic regression
Raman Kannan
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
DrGSakthiGovindaraju
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
Rupak Roy
 
R decision tree
R decision tree
Learnbay Datascience
 
Observations
Observations
butest
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Feature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
KalighatOkira
 
Tutorial machine learning with python - a tutorial
Tutorial machine learning with python - a tutorial
MarcusBBraga
 
Data Structure
Data Structure
sheraz1
 
Lec 1 Ds
Lec 1 Ds
Qundeel
 
Lec 1 Ds
Lec 1 Ds
Qundeel
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
bilaje4244prolugcom
 
r,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
LSTM Framework For Univariate Time series
LSTM Framework For Univariate Time series
bilyamine1
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 

More Related Content

Similar to Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf (20)

Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
Rai University
 
python-pandas-For-Data-Analysis-Manipulate.pptx
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
Chapter 02-logistic regression
Chapter 02-logistic regression
Raman Kannan
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
DrGSakthiGovindaraju
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
Rupak Roy
 
R decision tree
R decision tree
Learnbay Datascience
 
Observations
Observations
butest
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Feature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
KalighatOkira
 
Tutorial machine learning with python - a tutorial
Tutorial machine learning with python - a tutorial
MarcusBBraga
 
Data Structure
Data Structure
sheraz1
 
Lec 1 Ds
Lec 1 Ds
Qundeel
 
Lec 1 Ds
Lec 1 Ds
Qundeel
 
Feature Engineering in NLP.pdf
Feature Engineering in NLP.pdf
bilaje4244prolugcom
 
r,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
LSTM Framework For Univariate Time series
LSTM Framework For Univariate Time series
bilyamine1
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
Rai University
 
python-pandas-For-Data-Analysis-Manipulate.pptx
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
Chapter 02-logistic regression
Chapter 02-logistic regression
Raman Kannan
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
KabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
attalurilalitha
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
Rupak Roy
 
Observations
Observations
butest
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Feature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
KalighatOkira
 
Tutorial machine learning with python - a tutorial
Tutorial machine learning with python - a tutorial
MarcusBBraga
 
Data Structure
Data Structure
sheraz1
 
Lec 1 Ds
Lec 1 Ds
Qundeel
 
Lec 1 Ds
Lec 1 Ds
Qundeel
 
r,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
LSTM Framework For Univariate Time series
LSTM Framework For Univariate Time series
bilyamine1
 

Recently uploaded (20)

Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
Mobile database systems 20254545645.pptx
Mobile database systems 20254545645.pptx
herosh1968
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
special_edition_using_visual_foxpro_6.pdf
special_edition_using_visual_foxpro_6.pdf
Shabista Imam
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
Industrial internet of things IOT Week-3.pptx
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
hosseinihamid192023
 
How to Un-Obsolete Your Legacy Keypad Design
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
Mobile database systems 20254545645.pptx
Mobile database systems 20254545645.pptx
herosh1968
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
special_edition_using_visual_foxpro_6.pdf
special_edition_using_visual_foxpro_6.pdf
Shabista Imam
 
Structured Programming with C++ :: Kjell Backman
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
Industrial internet of things IOT Week-3.pptx
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
hosseinihamid192023
 
Ad

Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf

  • 1. Machine Learning with Python Machine Learning Algorithms - RANDOM FOREST Prof.ShibdasDutta, Associate Professor, DCGDATACORESYSTEMSINDIAPVTLTD Kolkata Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 2. Machine Learning Algorithms – Classification Algo- RANDOM FOREST Introduction - RANDOM FOREST As the name suggests, the Random forest is a “forest” of trees! i.e Decision Trees. A random forest is a tree-based machine learning algorithm that randomly selects specific features to build multiple decision trees. The random forest then combines the output of individual decision trees to generate the final output. Decision trees involve the greedy selection to the best split point from the dataset at each step. We can use random forest for classification as well as regression problems. If the total number of column in the training dataset is denoted by p : We take sqrt(p) number of columns for classification For regression, we take a p/3 number of columns. Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 3. WHEN TO USE RANDOM FOREST ? When we focus on accuracy rather than interpretation If you want better accuracy on the unexpected validation dataset HOW TO USE RANDOM FOREST ? Select random samples from a given dataset Construct decision trees from every sample and obtain their output Perform a vote for each predicted result. Most voted prediction is selected as the final prediction result. Random Forest
  • 4. Training Sample 1 Training Sample 2 Voting Prediction Training Sample 1 Training Sample n Training Sample 1 Training Sample 1 Training Set Test Set The following diagram will illustrate its working: Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 5. STOCK PREDICTION USING RANDOM FOREST-EXAMPLE import matplotlib.pyplot as plt import numpy as np import pandas as pd # Import the model we are using from sklearn.ensemble import RandomForestRegressor data = pd.read_csv('data.csv') data.head() Here, we will be using the dataset (available below) which contains seven columns namely date, open, high, low, close, volume, and name of the company. Here in this case google is the only company we have used. Open refers to the time at which people can begin trading on a particular exchange. Low represents a lower price point for a stock or index. High refers to a market milestone in which a stock or index reaches a greater price point than previously for a particular time period. Close simply refers to the time at which a stock exchange closes to trading. Volume refers to the number of shares of stock traded during a particular time period, normally measured in average daily trading volume. Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 6. abc=[] for i in range(len(data)): abc.append(data['Date'][i].split('-')) data['Date'][i] = ''.join(abc[i]) Using above dataset, we are now trying to predict the ‘Close’ Value based on all attributes. Let’s split the data into train and test dataset. #These are the labels: They describe what the stock price was over a period. X_1 = data.drop('Close',axis=1) Y_1 = data['Close'] # Using Skicit-learn to split data into training and testing sets from sklearn.model_selection import train_test_split X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, Y_1, test_size=0.33, random_state=42) Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 7. Now, let’s instantiate the model and train the model on training dataset: rfg = RandomForestRegressor(n_estimators= 10, random_state=42) rfg.fit(X_train_1,y_train_1) pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index( drop=True)], axis=1) Let’s find out the features on the basis of their importance by calculating numerical feature importances # Saving feature names for later use feature_list = list(X_1.columns) print(feature_list) # Get numerical feature importances importances = list(rfg.feature_importances_) # List of tuples with variable and importance feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)] # Sort the feature importances by most important first feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True) # Print out the feature and importances [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]; Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 8. rfg.score(X_test_1, y_test_1) We are getting an accuracy of ~99% while predicting. We then display the original value and the predicted Values. pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(drop=True)], axis=1) Prediction Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 9. ADVANTAGES OF RANDOM FOREST It reduces overfitting as it yields prediction based on majority voting. Random forest can be used for classification as well as regression. It works well on a large range of datasets. Random forest provides better accuracy on unseen data and even if some data is missing Data normalization isn’t required as it is a rule-based approach DISADVANTAGES Random forest requires much more computational power and memory space to build numerous decision trees. Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable. Random forests can be less intuitive for a large collection of decision trees. Using bagging techniques, Random forest makes trees only which are dependent on each other. Bagging might provide similar predictions in each tree as the same greedy algorithm is used to create each decision tree. Hence, it is likely to be using the same or very similar split points in each tree which mitigates the variance originally sought. Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 10. Thank You Company Confidential: Data-Core Systems, Inc. | datacoresystems.com