El curso crash de programación en diseño biológico introdujo el árbol de decisión como una herramienta de modelado predictivo que se puede aplicar en diversas áreas. Se discuten dos tipos de árboles de decisión: clasificación, donde la variable de decisión es categórica, y regresión, donde es continua, además de la metodología para calcular el índice de Gini y cómo construir árboles de decisión utilizando el algoritmo CART. También se mencionan los criterios para crear nodos terminales y realizar predicciones basadas en el árbol construido.
IRJET- Machine Learning: Survey, Types and ChallengesIRJET Journal
This document provides an overview of machine learning, including its types and challenges. It discusses supervised and unsupervised machine learning algorithms. Supervised learning uses labeled training data to predict discrete or continuous output values. Unsupervised learning finds hidden patterns in unlabeled data through clustering. Common supervised algorithms are logistic regression, decision trees, k-nearest neighbors and common unsupervised algorithm is clustering. The document also gives examples to explain machine learning concepts and algorithms.
Start machine learning in 5 simple stepsRenjith M P
The document outlines a five-step process to start machine learning programming, including selecting a use case, preparing data, choosing a programming language, implementing training and predictions, and evaluating accuracy. It specifically uses the iris flower dataset to demonstrate these steps, focusing on Python and libraries such as NumPy and scikit-learn for model training. The final model, K-Neighbors Classifier, achieved a prediction accuracy of 90% on the validation dataset.
The document provides an introduction to supervised learning. It discusses how supervised learning models are trained on labelled datasets containing both input data and corresponding results or labels. The model learns from these examples to predict accurate results for new, unseen data. Common applications of supervised learning mentioned include sentiment analysis, recommendations, and spam filtration. Decision trees and K-nearest neighbors are discussed as examples of supervised learning algorithms. Decision trees use a top-down approach to split the dataset into more homogeneous subsets. K-nearest neighbors classifies new data based on similarity to labelled examples in the training set.
Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm
The document discusses various machine learning methods for building models from data including supervised learning methods like classification and regression as well as unsupervised learning methods like clustering and dimensionality reduction. It also covers semi-supervised learning and reinforcement learning. Supervised learning uses labeled training data to learn relationships between inputs and outputs while unsupervised learning discovers patterns in unlabeled data.
Identifying and classifying unknown Network Disruptionjagan477830
This document discusses identifying and classifying unknown network disruptions using machine learning algorithms. It begins by introducing the problem and importance of identifying network disruptions. Then it discusses related work on classifying network protocols. The document outlines the dataset and problem statement of predicting fault severity. It describes the machine learning workflow and various algorithms like random forest, decision tree and gradient boosting that are evaluated on the dataset. Finally, it concludes with achieving the objective of classifying disruptions and discusses future work like optimizing features and using neural networks.
Implementation of Improved ID3 Algorithm to Obtain more Optimal Decision Tree.IJERD Editor
This paper presents an improved version of the ID3 algorithm for generating decision trees by incorporating an association function to enhance accuracy and decision-making. The traditional ID3 algorithm, while simple and widely used, tends to favor attributes with multiple values, compromising its accuracy. Experimental results indicate that the improved ID3 algorithm produces more optimal decision trees compared to the conventional method.
1. Machine learning is the use and development of computer systems that are able to learn and adapt without explicit instructions by using algorithms and statistical models to analyze patterns in data.
2. The document provides examples of machine learning applications like facial recognition, voice recognition in healthcare, weather forecasting, and more. It also discusses the process of machine learning and popular machine learning algorithms.
3. The document demonstrates machine learning using a decision tree algorithm on music purchase data to predict whether a customer is male or female based on attributes like age and number of songs purchased. It imports relevant Python libraries and splits the data into training and test sets to evaluate the model's performance.
The document discusses the creation and visualization of decision trees using Python, highlighting their significance in machine learning due to their interpretability and ease of use. It outlines essential concepts and elements of decision tree visualization, including decision nodes, leaf nodes, and the importance of attribute selection. The document also presents practical code examples for implementing decision trees and visualizing them with libraries like sklearn and pydotplus.
The document discusses decision tree induction and Bayesian classification techniques in data mining, outlining essential concepts, advantages, and disadvantages of decision tree methods. It covers attribute selection measures, tree pruning, scalability issues, and the principles of Bayesian classifiers, including Bayes' theorem and naive Bayes classification. Key topics include the iterative dichotomiser (ID3) algorithm, overfitting, and methods to improve classification accuracy.
This document summarizes a machine learning project to predict insurance claim severities for the Kaggle "Allstate Claims Severity" competition. It describes the dataset, preprocessing steps including one-hot encoding and outlier removal. A deep neural network model was trained using H2O. Hyperparameters were optimized in 4 phases: activation function, network architecture, outliers/epochs, and learning rate. 21 model submissions achieved test MAEs from 1,169 to 1,292, outperforming random forest benchmarks. Rectifier activation, 3 hidden layers of 1000 neurons total, removing outliers, 100 epochs, and a learning rate of 0.0001 produced strong results.
The document discusses the importance and techniques of data analytics, outlining its role in enhancing business decisions through methods such as classification and clustering. It details various types of analytics, including descriptive, diagnostic, predictive, and prescriptive analytics, as well as tools and algorithms used in the field, like decision trees and support vector machines. Additionally, it describes performance measures for evaluating classification models and provides insights into clustering techniques and their applications.
How to Build a Neural Network and Make PredictionsDeveloper Helps
The document provides a comprehensive guide on building neural networks, explaining their structure, including input, hidden, and output layers. It details the steps required to create a neural network using Python and TensorFlow, from data preparation to model training and evaluation, culminating in a model that achieved a test accuracy of approximately 97.61% for classifying handwritten digits. The document emphasizes the importance of ethical AI usage and a strong understanding of machine learning concepts and frameworks.
This document discusses classification and prediction in data analysis. It defines classification as predicting categorical class labels, such as predicting if a loan applicant is risky or safe. Prediction predicts continuous numeric values, such as predicting how much a customer will spend. The document provides examples of classification, including a bank predicting loan risk and a company predicting computer purchases. It also provides an example of prediction, where a company predicts customer spending. It then discusses how classification works, including building a classifier model from training data and using the model to classify new data. Finally, it discusses decision tree induction for classification and the k-means algorithm.
The document discusses classification and prediction in machine learning, emphasizing techniques like decision tree induction and the random forest algorithm for categorizing data based on training datasets. It highlights the two-step process of classification, which involves model construction and prediction, as well as different classification methods, including the naïve Bayes classifier that utilizes Bayes' theorem for probability-based predictions. Additionally, it addresses issues related to data preparation, evaluating classification methods, attribute selection, and the advantages of various algorithms.
Machine learning Chapter three (16).pptxjamsibro140
Chapter 3 discusses classification and prediction in machine learning, defining classification as the categorization of data into classes based on labeled data for training. It covers various classification methods including decision trees, Bayesian classification, and support vector machines, highlighting the importance of classifier accuracy and data preprocessing. The document also elaborates on the processes involved in building classifiers, the use of metrics like information gain and Gini index for attribute selection, and provides examples of practical applications of classification.
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
The document outlines a capstone project focused on predicting Google's stock prices using deep learning models, specifically comparing RNN, LSTM, and CNN+LSTM on stock datasets from 2014 to 2017. The aim is to evaluate the effectiveness of CNN as a feature extractor in time-series forecasting, establishing benchmarks with RNN models based on mean squared error (MSE). It includes details about data preprocessing, model implementation, and performance metrics, concluding that LSTM outperforms the RNN benchmark.
The document is an internship report submitted by Amit Kumar to Persistent System Limited detailing work done to classify handwritten digits using machine learning algorithms. It provides an overview of tasks completed including understanding the problem and data, building a random forest model to classify digits, and evaluating the model's performance. Multiple models were created using random samples of the training data and results were aggregated to validate the overall accuracy of the digit classification.
The document outlines a project aimed at analyzing customer churn in the telecom sector, detailing the significance of predicting churn to retain high-value customers. It discusses the methodology involving data preparation, exploratory data analysis, and various machine learning models, including logistic regression, decision trees, and random forests, to predict churn effectively. Ultimately, the decision tree and random forest models performed best, confirming that churned customers are often high-value clients.
This document introduces a session on data analysis using the R programming language, outlining the significance of data analysis and the essential skills for becoming a data analyst. It details the steps involved in data analysis, from problem statement formulation to data preprocessing, modeling, prediction, and reporting, including specific functions in R and best practices. The document emphasizes the importance of statistical knowledge, data mining techniques, and tools like R and Python in handling data efficiently.
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
The document discusses sentiment analysis of online product reviews using machine learning algorithms. It first provides background on sentiment analysis and its uses. It then describes preprocessing customer review data and extracting features using count and TF-IDF vectorization. Three machine learning algorithms are tested - support vector machine (SVM), random forest, and XGBoost classifier. The results show that XGBoost achieved higher accuracy than SVM and random forest for sentiment classification of the product review data.
This document outlines a project to develop a machine learning model to predict house prices in the United States. It describes the company developing the system, provides background on machine learning and the problem domain, and outlines the objectives, requirements, methodology, design, and expected results of the project. The proposed methodology involves collecting house data, preprocessing it, training a random forest model on 80% of the data and testing it on the remaining 20%, and using the trained model to predict house prices. The system is intended to help buyers search for homes within their budget and avoid being misled on prices.
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
An algorithm is a set of steps to solve a problem. Supervised learning uses labeled training data to teach models patterns which they can then use to predict labels for new unlabeled data. Unsupervised learning uses clustering and pattern detection to analyze and group unlabeled data. SageMaker is a fully managed service that allows users to build, train and deploy machine learning models and includes components for managing notebooks, labeling data, and deploying models through endpoints.
1) Interval classifiers are machine learning algorithms that originated in artificial intelligence research but are now being applied to database mining. They generate decision trees to classify data into intervals based on attribute values.
2) The author implemented the IC interval classifier algorithm and tested it on small datasets, finding higher classification errors than reported in literature due to small training set sizes. Parameter testing showed accuracy improved with larger training sets and more restrictive interval definitions.
3) While efficiency couldn't be fully tested, results suggest interval classifiers may perform well for database applications if further tuned based on dataset characteristics. More research is still needed on algorithm modifications and dynamic training approaches.
IRJET - Recommendation System using Big Data Mining on Social NetworksIRJET Journal
The document describes a job recommendation system that uses machine learning algorithms and natural language processing on data collected from social networks. It uses naive Bayes, logistic regression, and random forest models to recommend jobs to users based on their profiles. The system tokenizes and lemmatizes words from user input before making recommendations. Evaluation of the models found naive Bayes had the highest training accuracy at 94.56% but lowest testing accuracy at 69.19%, while random forest had the highest testing accuracy at 70%.
The document describes developing a model to predict house prices using deep learning techniques. It proposes using a dataset with house features without labels and applying regression algorithms like K-nearest neighbors, support vector machine, and artificial neural networks. The models are trained and tested on split data, with the artificial neural network achieving the lowest mean absolute percentage error of 18.3%, indicating it is the most accurate model for predicting house prices based on the data.
Feature-selection-techniques to be used in machine learning algorithmsssuser363702
The document discusses methods for feature selection using the chi-squared test for categorical data and Pearson's correlation coefficient and PCA for numeric data, utilizing the bank marketing UCI dataset. It explains the process of isolating categorical features, selecting the best features, and how to handle both categorical and numeric data in model training. The analysis reveals the most important features impacting the model, ultimately reducing the number of features used.
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...resming1
This covers CST413 KTU S7 CSE Machine Learning Module 4 topics - Clustering, K Means clustering, Hierarchical Agglomerative clustering, Principal Component Analysis, and Expectation Maximization.
1. Machine learning is the use and development of computer systems that are able to learn and adapt without explicit instructions by using algorithms and statistical models to analyze patterns in data.
2. The document provides examples of machine learning applications like facial recognition, voice recognition in healthcare, weather forecasting, and more. It also discusses the process of machine learning and popular machine learning algorithms.
3. The document demonstrates machine learning using a decision tree algorithm on music purchase data to predict whether a customer is male or female based on attributes like age and number of songs purchased. It imports relevant Python libraries and splits the data into training and test sets to evaluate the model's performance.
The document discusses the creation and visualization of decision trees using Python, highlighting their significance in machine learning due to their interpretability and ease of use. It outlines essential concepts and elements of decision tree visualization, including decision nodes, leaf nodes, and the importance of attribute selection. The document also presents practical code examples for implementing decision trees and visualizing them with libraries like sklearn and pydotplus.
The document discusses decision tree induction and Bayesian classification techniques in data mining, outlining essential concepts, advantages, and disadvantages of decision tree methods. It covers attribute selection measures, tree pruning, scalability issues, and the principles of Bayesian classifiers, including Bayes' theorem and naive Bayes classification. Key topics include the iterative dichotomiser (ID3) algorithm, overfitting, and methods to improve classification accuracy.
This document summarizes a machine learning project to predict insurance claim severities for the Kaggle "Allstate Claims Severity" competition. It describes the dataset, preprocessing steps including one-hot encoding and outlier removal. A deep neural network model was trained using H2O. Hyperparameters were optimized in 4 phases: activation function, network architecture, outliers/epochs, and learning rate. 21 model submissions achieved test MAEs from 1,169 to 1,292, outperforming random forest benchmarks. Rectifier activation, 3 hidden layers of 1000 neurons total, removing outliers, 100 epochs, and a learning rate of 0.0001 produced strong results.
The document discusses the importance and techniques of data analytics, outlining its role in enhancing business decisions through methods such as classification and clustering. It details various types of analytics, including descriptive, diagnostic, predictive, and prescriptive analytics, as well as tools and algorithms used in the field, like decision trees and support vector machines. Additionally, it describes performance measures for evaluating classification models and provides insights into clustering techniques and their applications.
How to Build a Neural Network and Make PredictionsDeveloper Helps
The document provides a comprehensive guide on building neural networks, explaining their structure, including input, hidden, and output layers. It details the steps required to create a neural network using Python and TensorFlow, from data preparation to model training and evaluation, culminating in a model that achieved a test accuracy of approximately 97.61% for classifying handwritten digits. The document emphasizes the importance of ethical AI usage and a strong understanding of machine learning concepts and frameworks.
This document discusses classification and prediction in data analysis. It defines classification as predicting categorical class labels, such as predicting if a loan applicant is risky or safe. Prediction predicts continuous numeric values, such as predicting how much a customer will spend. The document provides examples of classification, including a bank predicting loan risk and a company predicting computer purchases. It also provides an example of prediction, where a company predicts customer spending. It then discusses how classification works, including building a classifier model from training data and using the model to classify new data. Finally, it discusses decision tree induction for classification and the k-means algorithm.
The document discusses classification and prediction in machine learning, emphasizing techniques like decision tree induction and the random forest algorithm for categorizing data based on training datasets. It highlights the two-step process of classification, which involves model construction and prediction, as well as different classification methods, including the naïve Bayes classifier that utilizes Bayes' theorem for probability-based predictions. Additionally, it addresses issues related to data preparation, evaluating classification methods, attribute selection, and the advantages of various algorithms.
Machine learning Chapter three (16).pptxjamsibro140
Chapter 3 discusses classification and prediction in machine learning, defining classification as the categorization of data into classes based on labeled data for training. It covers various classification methods including decision trees, Bayesian classification, and support vector machines, highlighting the importance of classifier accuracy and data preprocessing. The document also elaborates on the processes involved in building classifiers, the use of metrics like information gain and Gini index for attribute selection, and provides examples of practical applications of classification.
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
The document outlines a capstone project focused on predicting Google's stock prices using deep learning models, specifically comparing RNN, LSTM, and CNN+LSTM on stock datasets from 2014 to 2017. The aim is to evaluate the effectiveness of CNN as a feature extractor in time-series forecasting, establishing benchmarks with RNN models based on mean squared error (MSE). It includes details about data preprocessing, model implementation, and performance metrics, concluding that LSTM outperforms the RNN benchmark.
The document is an internship report submitted by Amit Kumar to Persistent System Limited detailing work done to classify handwritten digits using machine learning algorithms. It provides an overview of tasks completed including understanding the problem and data, building a random forest model to classify digits, and evaluating the model's performance. Multiple models were created using random samples of the training data and results were aggregated to validate the overall accuracy of the digit classification.
The document outlines a project aimed at analyzing customer churn in the telecom sector, detailing the significance of predicting churn to retain high-value customers. It discusses the methodology involving data preparation, exploratory data analysis, and various machine learning models, including logistic regression, decision trees, and random forests, to predict churn effectively. Ultimately, the decision tree and random forest models performed best, confirming that churned customers are often high-value clients.
This document introduces a session on data analysis using the R programming language, outlining the significance of data analysis and the essential skills for becoming a data analyst. It details the steps involved in data analysis, from problem statement formulation to data preprocessing, modeling, prediction, and reporting, including specific functions in R and best practices. The document emphasizes the importance of statistical knowledge, data mining techniques, and tools like R and Python in handling data efficiently.
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
The document discusses sentiment analysis of online product reviews using machine learning algorithms. It first provides background on sentiment analysis and its uses. It then describes preprocessing customer review data and extracting features using count and TF-IDF vectorization. Three machine learning algorithms are tested - support vector machine (SVM), random forest, and XGBoost classifier. The results show that XGBoost achieved higher accuracy than SVM and random forest for sentiment classification of the product review data.
This document outlines a project to develop a machine learning model to predict house prices in the United States. It describes the company developing the system, provides background on machine learning and the problem domain, and outlines the objectives, requirements, methodology, design, and expected results of the project. The proposed methodology involves collecting house data, preprocessing it, training a random forest model on 80% of the data and testing it on the remaining 20%, and using the trained model to predict house prices. The system is intended to help buyers search for homes within their budget and avoid being misled on prices.
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
An algorithm is a set of steps to solve a problem. Supervised learning uses labeled training data to teach models patterns which they can then use to predict labels for new unlabeled data. Unsupervised learning uses clustering and pattern detection to analyze and group unlabeled data. SageMaker is a fully managed service that allows users to build, train and deploy machine learning models and includes components for managing notebooks, labeling data, and deploying models through endpoints.
1) Interval classifiers are machine learning algorithms that originated in artificial intelligence research but are now being applied to database mining. They generate decision trees to classify data into intervals based on attribute values.
2) The author implemented the IC interval classifier algorithm and tested it on small datasets, finding higher classification errors than reported in literature due to small training set sizes. Parameter testing showed accuracy improved with larger training sets and more restrictive interval definitions.
3) While efficiency couldn't be fully tested, results suggest interval classifiers may perform well for database applications if further tuned based on dataset characteristics. More research is still needed on algorithm modifications and dynamic training approaches.
IRJET - Recommendation System using Big Data Mining on Social NetworksIRJET Journal
The document describes a job recommendation system that uses machine learning algorithms and natural language processing on data collected from social networks. It uses naive Bayes, logistic regression, and random forest models to recommend jobs to users based on their profiles. The system tokenizes and lemmatizes words from user input before making recommendations. Evaluation of the models found naive Bayes had the highest training accuracy at 94.56% but lowest testing accuracy at 69.19%, while random forest had the highest testing accuracy at 70%.
The document describes developing a model to predict house prices using deep learning techniques. It proposes using a dataset with house features without labels and applying regression algorithms like K-nearest neighbors, support vector machine, and artificial neural networks. The models are trained and tested on split data, with the artificial neural network achieving the lowest mean absolute percentage error of 18.3%, indicating it is the most accurate model for predicting house prices based on the data.
Feature-selection-techniques to be used in machine learning algorithmsssuser363702
The document discusses methods for feature selection using the chi-squared test for categorical data and Pearson's correlation coefficient and PCA for numeric data, utilizing the bank marketing UCI dataset. It explains the process of isolating categorical features, selecting the best features, and how to handle both categorical and numeric data in model training. The analysis reveals the most important features impacting the model, ultimately reducing the number of features used.
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...resming1
This covers CST413 KTU S7 CSE Machine Learning Module 4 topics - Clustering, K Means clustering, Hierarchical Agglomerative clustering, Principal Component Analysis, and Expectation Maximization.
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & ImpactAlqualsaDIResearchGr
Invited keynote at the Artificial Intelligence Symposium on AI-powered Research Innovation, taking place at ENSEM (L'École Nationale Supérieure d'Électricité et de Mécanique), Casablanca on June 21, 2025. I’ll be giving a keynote titled: "Generative AI & Scientific Research: Catalyst for Innovation, Ethics & Impact". Looking forward to engaging with researchers and doctoral students on how Generative AI is reshaping the future of science, from discovery to governance — with both opportunities and responsibilities in focus.
#AI hashtag#GenerativeAI #ScientificResearch #Innovation #Ethics #Keynote #AIinScience #GAI #ResearchInnovation #Casablanca
1. Thinking, Creative Thinking, Innovation
2. Societies Evolution from 1.0 to 5.0
3. AI - 3P Approach, Use Cases & Innovation
4. GAI & Creativity
5. TrustWorthy AI
6. Guidelines on The Responsible use of GAI In Research
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management ProcessIJDKP
Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. There is an urgent need for a new generation of computational theories and tools to assist researchers in extracting useful information from the rapidly growing volumes of digital data.
This Journal provides a forum for researchers who address this issue and to present their work in a peer-reviewed open access forum. Authors are solicited to contribute to the Journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
Structured Programming with C++ :: Kjell BackmanShabista Imam
Step into the world of high-performance programming with the Complete Guidance Book of C++ Programming—a definitive resource for mastering one of the most powerful and versatile languages in computer science.
Whether you're a beginner looking to learn the fundamentals or an intermediate developer aiming to sharpen your skills, this book walks you through C++ from the ground up. You'll start with basics like variables, control structures, and functions, then progress to object-oriented programming (OOP), memory management, file handling, templates, and the Standard Template Library (STL).
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...Mark Billinghurst
This is lecture 6 in the course on Rapid Prototyping for XR, taught on June 13th, 2025 by Mark Billinghurst. This lecture was about using AI for Prototyping and Research Directions.
Rapid Prototyping for XR: Lecture 1 Introduction to PrototypingMark Billinghurst
Lecture 1 of a course on Rapid Prototyping for XR taught by Mark Billinghurst at Oulu University on June 9th, 2025. This lecture presents an Introduction to Prototyping.
NEW Strengthened Senior High School Gen Math.pptxDaryllWhere
Ad
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
1. Machine Learning with Python
Machine Learning Algorithms - Decision Tree
Prof.ShibdasDutta,
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
2. Machine Learning Algorithms – Classification Algo- Decision Tree
Introduction - Decision Tree
In general, Decision tree analysis is a predictive modelling tool that can be applied
across many areas. Decision trees can be constructed by an algorithmic approach that
can split the dataset in different ways based on different conditions.
Decisions tress are the most powerful algorithms that falls under the category of
supervised algorithms.
They can be used for both classification and regression tasks.
The two main entities of a tree are decision nodes, where the data is split and leaves,
where we got outcome.
The example of a binary tree for predicting whether a person is fit or unfit providing
various information like age, eating habits and exercise habits, is given below:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
4. In the above decision tree, the question are decision nodes and final outcomes are leaves.
We have the following two types of decision trees:
Classification decision trees: In this kind of decision trees, the decision variable is
categorical. The above decision tree is an example of classification decision tree.
Regression decision trees: In this kind of decision trees, the decision variable is
continuous.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
5. Implementing Decision Tree Algorithm
Gini Index
It is the name of the cost function that is used to evaluate the binary splits in the dataset and
works with the categorial target variable “Success” or “Failure”.
Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and
worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the help of
following steps:
First, calculate Gini index for sub-nodes by using the formula p^2+q^2 , which is the sum of
the square of probability for success and failure.
Next, calculate Gini index for split using weighted Gini score of each node of that split.
Classification and Regression Tree (CART) algorithm uses Gini method to generate
binary splits.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
6. Split Creation
A split is basically including an attribute in the dataset and a value. We can create a split in
dataset with the help of following three parts:
Part1: Calculating Gini Score: We have just discussed this part in the previous section.
Part2: Splitting a dataset: It may be defined as separating a dataset into two lists of rows
having index of an attribute and a split value of that attribute. After getting the two groups - right
and left, from the dataset, we can calculate the value of split by using Gini score calculated in first
part. Split value will decide in which group the attribute will reside.
Part3: Evaluating all splits: Next part after finding Gini score and splitting dataset is the
evaluation of all splits. For this purpose, first, we must check every value associated with each
attribute as a candidate split. Then we need to find the best possible split by evaluating the cost
of the split. The best split will be used as a node in the decision tree.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
7. Building a Tree
As we know that a tree has root node and terminal nodes. After creating the root node, we can
build the tree by following two parts:
Part1: Terminal node creation
While creating terminal nodes of decision tree, one important point is to decide when to stop
growing tree or creating further terminal nodes. It can be done by using two criteria namely
maximum tree depth and minimum node records as follows:
Maximum Tree Depth: As name suggests, this is the maximum number of the nodes in a tree
after root node. We must stop adding terminal nodes once a tree
reached at maximum depth i.e. once a tree got maximum number of terminal nodes.
Minimum Node Records: It may be defined as the minimum number of training patterns that
a given node is responsible for. We must stop adding terminal nodes once tree reached at these
minimum node records or below this minimum.
Terminal node is used to make a final prediction.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
8. Part2:
Recursive Splitting
As we understood about when to create terminal nodes, now we can start building our tree.
Recursive splitting is a method to build the tree. In this method, once a node is created, we can
create the child nodes (nodes added to an existing node) recursively on each group of data,
generated by splitting the dataset, by calling the same function again and again.
Prediction
After building a decision tree, we need to make a prediction about it. Basically, prediction
involves navigating the decision tree with the specifically provided row of data.
We can make a prediction with the help of recursive function, as did above. The same
prediction routine is called again with the left or the child right nodes.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
9. Assumptions
The following are some of the assumptions we make while creating decision tree:
While preparing decision trees, the training set is as root node.
Decision tree classifier prefers the features values to be categorical. In case if you want to use
continuous values then they must be done discretized prior to model building.
Based on the attribute’s values, the records are recursively distributed.
Statistical approach will be used to place attributes at any node position i.e.as root node or
internal node.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
10. Implementation in Python
Example
In the following example, we are going to implement Decision Tree classifier on Pima Indian
Diabetes:
First, start with importing necessary python packages:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Next, download the iris dataset from its weblink as follows:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
11. pregnant glucose bp skin insulin bmi pedigree age label
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Now, split the dataset into features and target variable as follows:
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
12. Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and
30% of testing data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1)
Next, train the model with the help of DecisionTreeClassifier class of sklearn as follows:
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
At last we need to make prediction. It can be done with the help of following script:
y_pred = clf.predict(X_test)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
13. Next, we can get the accuracy score, confusion matrix and classification report as follows:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
15. Visualizing Decision Tree
The above decision tree can be visualized with the help of following code:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Image(graph.create_png())
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com