The document discusses supervised learning as a subfield of machine learning, focusing on classification and regression problems, particularly in the context of loan funding prediction at Lending Club. It explains how classifiers can be used to predict binary outcomes, such as whether a loan will be fully funded based on input features. Performance metrics like accuracy and confusion matrices are also introduced for evaluating classifier effectiveness.
The document provides an overview of the Naïve Bayes classifier algorithm, a supervised learning method based on Bayes' theorem, primarily utilized for text classification and fast predictions. It explains the independence assumption of features, the application of Bayes' theorem in classification problems, and includes practical implementation steps in Python. The advantages and disadvantages of the algorithm, along with its various applications, such as spam filtering and credit scoring, are also discussed.
The document discusses classifying handwritten digits from the MNIST dataset using various machine learning classifiers and evaluation metrics. It begins with binary classification of the digit 5 using SGDClassifier, evaluating accuracy which is misleading due to class imbalance. The document then introduces confusion matrices and precision/recall metrics to better evaluate performance. It demonstrates how precision and recall can be traded off by varying the decision threshold, and introduces ROC curves to visualize this tradeoff. Finally, it compares SGDClassifier and RandomForestClassifier on this binary classification task.
introducatio to ml introducatio to ml introducatio to mlDecentMusicians
This document provides an introduction to a machine learning course, detailing its objectives, learning outcomes, and evaluation criteria. It covers various machine learning algorithms, including supervised and unsupervised approaches, along with performance evaluation metrics and methodologies. The syllabus includes modules on topics such as regression, classification, clustering, dimensionality reduction, and ensemble models.
Machine learning algorithms can be used to make predictions from data. There are several types of algorithms for supervised learning tasks like regression and classification, as well as unsupervised learning tasks like clustering and dimensionality reduction. The scikit-learn library provides popular machine learning algorithms and datasets that can be used to fit models to data and validate performance. Key steps in the machine learning process include getting data, selecting an algorithm, fitting the model to training data, and evaluating performance on test data to avoid overfitting or underfitting. Performance metrics like precision, recall, and F1 score are used to quantify how well models generalize to new data.
This document serves as an introductory guide to machine learning, outlining key concepts such as algorithms, model selection, and performance measurement using the scikit-learn library. It covers various types of learning like supervised (regression and classification) and unsupervised learning (clustering), along with practical examples using datasets such as Iris and techniques like linear regression and K-nearest neighbors. The document also addresses model validation concerns, including underfitting and overfitting, and emphasizes the importance of metrics like precision, recall, and the F1 score.
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
The document covers essential machine learning concepts, focusing on classification, regression, and performance assessment strategies, including binary and multiclass classification techniques. It describes the steps for building classifiers using Python and emphasizes evaluating model performance using metrics such as confusion matrix, accuracy, precision, recall, and AUC-ROC. Additionally, it discusses the importance of selecting appropriate metrics based on the problem context and the characteristics of the data.
This document discusses various classification algorithms including logistic regression, Naive Bayes, support vector machines, k-nearest neighbors, decision trees, and random forests. It provides examples of using logistic regression and support vector machines for classification tasks. For logistic regression, it demonstrates building a model to classify handwritten digits from the MNIST dataset. For support vector machines, it uses a banknote authentication dataset to classify currency notes as authentic or fraudulent. The document discusses evaluating model performance using metrics like confusion matrix, accuracy, precision, recall, and F1 score.
Classification: MNIST, training a Binary classifier, performance measure, multiclass classification, error
analysis, multi label classification, multi output classification.
This document provides an introduction to machine learning, including examples of applications in medical diagnosis, object recognition, and finance. It outlines the main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves predicting target values based on labeled training data, and can be used for classification or regression problems. Unsupervised learning involves discovering hidden patterns in unlabeled data through clustering. Reinforcement learning involves agents learning policies from rewards and punishments. The document also discusses inductive learning, hypothesis spaces, evaluation methods like accuracy and cross-validation, and challenges in evaluating models with limited data.
The document provides an overview of machine learning, defining it as the field that enables computers to learn from data. It covers types of machine learning such as supervised, unsupervised, and reinforcement learning, along with examples and specific datasets like the Iris dataset. It also lists tools and resources for machine learning, emphasizing the importance of data preparation and algorithm selection.
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
The document provides an overview of machine learning concepts including definitions, algorithms, and the machine learning pipeline. It discusses supervised and unsupervised learning algorithms like classification, regression, and clustering. It also describes steps in the machine learning pipeline such as data preparation, algorithm selection, model building, evaluation, and prediction. Examples of applications like spam filtering and recommendations are provided. The agenda outlines an introduction to machine learning algorithms and their implementation for different use cases.
Chapter 5 Introduction to Machine Learning with Scikit-learn.pptxTngNguynSn19
The document introduces machine learning with scikit-learn, focusing on how algorithms learn from data to improve over time. It outlines the differences between supervised and unsupervised learning, detailing their applications and the six key stages in the machine learning process—from problem understanding to model deployment. Furthermore, it highlights the importance of data quality and provides practical insights via experiments with the scikit-learn library.
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET Journal
This document provides an unabridged review of supervised machine learning regression and classification techniques. It begins with an introduction to machine learning and artificial intelligence. It then describes regression and classification techniques for supervised learning problems, including linear regression, logistic regression, k-nearest neighbors, naive bayes, decision trees, support vector machines, and random forests. Practical examples are provided using Python code for applying these techniques to housing price prediction and iris species classification problems. The document concludes that the primary goal was to provide an extensive review of supervised machine learning methods.
This document provides an overview of machine learning concepts. It defines machine learning as creating computer programs that improve with experience. Supervised learning uses labeled training data to build models that can classify or predict new examples, while unsupervised learning finds patterns in unlabeled data. Examples of machine learning applications include spam filtering, recommendation systems, and medical diagnosis. The document also discusses important machine learning techniques like k-nearest neighbors, decision trees, regularization, and cross-validation.
The document provides an introduction to classification in machine learning, explaining its purpose as a means of predicting categorical outcomes from data points. It outlines the differences between lazy and eager learners, alongside examples of each, and discusses how to build classifiers in Python using the scikit-learn library. Additionally, it highlights various classification algorithms and their applications, including speech recognition and biometric identification.
The document provides an overview of machine learning, defining it as the ability for computers to learn from data without explicit programming. It discusses various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with examples and the importance of decision trees in classification tasks. The document also outlines how to prepare datasets, types of algorithms, and details the decision tree mechanism, including concepts of entropy and information gain to optimize classification results.
The document provides an overview of various machine learning classification algorithms including decision trees, lazy learners like K-nearest neighbors, decision lists, naive Bayes, artificial neural networks, and support vector machines. It also discusses evaluating and combining classifiers, as well as preprocessing techniques like feature selection and dimensionality reduction.
This document provides an overview of various machine learning classification techniques including decision trees, k-nearest neighbors, decision lists, naive Bayes, artificial neural networks, and support vector machines. For each technique, it discusses the basic approach, how models are trained and tested, and potential issues that may arise such as overfitting, parameter selection, and handling different data types.
The document discusses machine learning classification using the MNIST dataset of handwritten digits. It begins by defining classification and providing examples. It then describes the MNIST dataset and how it is fetched in scikit-learn. The document outlines the steps of classification which include dividing the data into training and test sets, training a classifier on the training set, testing it on the test set, and evaluating performance. It specifically trains a stochastic gradient descent (SGD) classifier on the MNIST data. The performance is evaluated using cross validation accuracy, confusion matrix, and metrics like precision and recall.
Hands-on - Machine Learning using scikitLearnavrtraining021
presentation discuss the importance of Machine Learning and using python to perform predictive ML ,
classical example of IRIS flower prediction using ML
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
This document evaluates the performance of various classification algorithms (logistic regression, K-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes) on a heart disease dataset. It provides details on each algorithm and evaluates their performance based on metrics like confusion matrix, precision, recall, F1-score and accuracy. The results show that naive Bayes had the best performance in correctly classifying samples with an accuracy of 80.21%, while SVM had the worst at 46.15%. In general, random forest and naive Bayes performed best according to the evaluation.
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
This document evaluates the performance of various classification algorithms (logistic regression, K-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes) on a heart disease dataset. It provides details on each algorithm and evaluates their performance based on metrics like confusion matrix, precision, recall, F1-score and accuracy. The results show that naive Bayes had the best performance in correctly classifying samples with an accuracy of 80.21%, while SVM had the worst at 46.15%. In general, random forest and naive Bayes performed best according to the evaluation.
This document provides information about the Machine Learning course EC-452 offered in Fall 2023. It is a 3 credit elective course that can be taken by DE-42 (Electrical) students in their 7th semester. Assessment will include a midterm exam, final exam, quizzes and assignments. Topics that will be covered include introduction to machine learning, applications of machine learning, common understanding of machine learning concepts and jargon, supervised learning workflow and notation, data representation, hypothesis space, classes of machine learning algorithms, and algorithm categorization schemes.
This document outlines the steps for developing a predictive modeling project in Python:
1) Select an appropriate modeling technique based on the type of problem, amount of data, and other factors.
2) Prepare the data for modeling by formatting, mapping text to numbers, and splitting into features and targets.
3) Validate the model selection by evaluating performance on test data.
4) Implement the trained model in a production environment to make predictions on new data.
The document provides an overview of machine learning, including definitions of machine learning, the differences between programming and machine learning, examples of machine learning applications, and descriptions of various machine learning algorithms and techniques. It discusses supervised learning methods like classification and regression. Unsupervised learning methods like clustering are also covered. The document outlines the machine learning process and provides cautions about machine learning.
This document discusses various methods for evaluating machine learning models, including:
- Using train, test, and validation sets to evaluate models on large datasets. Cross-validation is recommended for smaller datasets.
- Accuracy, error, precision, recall, and other metrics to quantify a model's performance using a confusion matrix.
- Lift charts and gains charts provide a visual comparison of a model's performance compared to no model. They are useful when costs are associated with different prediction outcomes.
Classification: MNIST, training a Binary classifier, performance measure, multiclass classification, error
analysis, multi label classification, multi output classification.
This document provides an introduction to machine learning, including examples of applications in medical diagnosis, object recognition, and finance. It outlines the main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves predicting target values based on labeled training data, and can be used for classification or regression problems. Unsupervised learning involves discovering hidden patterns in unlabeled data through clustering. Reinforcement learning involves agents learning policies from rewards and punishments. The document also discusses inductive learning, hypothesis spaces, evaluation methods like accuracy and cross-validation, and challenges in evaluating models with limited data.
The document provides an overview of machine learning, defining it as the field that enables computers to learn from data. It covers types of machine learning such as supervised, unsupervised, and reinforcement learning, along with examples and specific datasets like the Iris dataset. It also lists tools and resources for machine learning, emphasizing the importance of data preparation and algorithm selection.
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
The document provides an overview of machine learning concepts including definitions, algorithms, and the machine learning pipeline. It discusses supervised and unsupervised learning algorithms like classification, regression, and clustering. It also describes steps in the machine learning pipeline such as data preparation, algorithm selection, model building, evaluation, and prediction. Examples of applications like spam filtering and recommendations are provided. The agenda outlines an introduction to machine learning algorithms and their implementation for different use cases.
Chapter 5 Introduction to Machine Learning with Scikit-learn.pptxTngNguynSn19
The document introduces machine learning with scikit-learn, focusing on how algorithms learn from data to improve over time. It outlines the differences between supervised and unsupervised learning, detailing their applications and the six key stages in the machine learning process—from problem understanding to model deployment. Furthermore, it highlights the importance of data quality and provides practical insights via experiments with the scikit-learn library.
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET Journal
This document provides an unabridged review of supervised machine learning regression and classification techniques. It begins with an introduction to machine learning and artificial intelligence. It then describes regression and classification techniques for supervised learning problems, including linear regression, logistic regression, k-nearest neighbors, naive bayes, decision trees, support vector machines, and random forests. Practical examples are provided using Python code for applying these techniques to housing price prediction and iris species classification problems. The document concludes that the primary goal was to provide an extensive review of supervised machine learning methods.
This document provides an overview of machine learning concepts. It defines machine learning as creating computer programs that improve with experience. Supervised learning uses labeled training data to build models that can classify or predict new examples, while unsupervised learning finds patterns in unlabeled data. Examples of machine learning applications include spam filtering, recommendation systems, and medical diagnosis. The document also discusses important machine learning techniques like k-nearest neighbors, decision trees, regularization, and cross-validation.
The document provides an introduction to classification in machine learning, explaining its purpose as a means of predicting categorical outcomes from data points. It outlines the differences between lazy and eager learners, alongside examples of each, and discusses how to build classifiers in Python using the scikit-learn library. Additionally, it highlights various classification algorithms and their applications, including speech recognition and biometric identification.
The document provides an overview of machine learning, defining it as the ability for computers to learn from data without explicit programming. It discusses various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with examples and the importance of decision trees in classification tasks. The document also outlines how to prepare datasets, types of algorithms, and details the decision tree mechanism, including concepts of entropy and information gain to optimize classification results.
The document provides an overview of various machine learning classification algorithms including decision trees, lazy learners like K-nearest neighbors, decision lists, naive Bayes, artificial neural networks, and support vector machines. It also discusses evaluating and combining classifiers, as well as preprocessing techniques like feature selection and dimensionality reduction.
This document provides an overview of various machine learning classification techniques including decision trees, k-nearest neighbors, decision lists, naive Bayes, artificial neural networks, and support vector machines. For each technique, it discusses the basic approach, how models are trained and tested, and potential issues that may arise such as overfitting, parameter selection, and handling different data types.
The document discusses machine learning classification using the MNIST dataset of handwritten digits. It begins by defining classification and providing examples. It then describes the MNIST dataset and how it is fetched in scikit-learn. The document outlines the steps of classification which include dividing the data into training and test sets, training a classifier on the training set, testing it on the test set, and evaluating performance. It specifically trains a stochastic gradient descent (SGD) classifier on the MNIST data. The performance is evaluated using cross validation accuracy, confusion matrix, and metrics like precision and recall.
Hands-on - Machine Learning using scikitLearnavrtraining021
presentation discuss the importance of Machine Learning and using python to perform predictive ML ,
classical example of IRIS flower prediction using ML
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
This document evaluates the performance of various classification algorithms (logistic regression, K-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes) on a heart disease dataset. It provides details on each algorithm and evaluates their performance based on metrics like confusion matrix, precision, recall, F1-score and accuracy. The results show that naive Bayes had the best performance in correctly classifying samples with an accuracy of 80.21%, while SVM had the worst at 46.15%. In general, random forest and naive Bayes performed best according to the evaluation.
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
This document evaluates the performance of various classification algorithms (logistic regression, K-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes) on a heart disease dataset. It provides details on each algorithm and evaluates their performance based on metrics like confusion matrix, precision, recall, F1-score and accuracy. The results show that naive Bayes had the best performance in correctly classifying samples with an accuracy of 80.21%, while SVM had the worst at 46.15%. In general, random forest and naive Bayes performed best according to the evaluation.
This document provides information about the Machine Learning course EC-452 offered in Fall 2023. It is a 3 credit elective course that can be taken by DE-42 (Electrical) students in their 7th semester. Assessment will include a midterm exam, final exam, quizzes and assignments. Topics that will be covered include introduction to machine learning, applications of machine learning, common understanding of machine learning concepts and jargon, supervised learning workflow and notation, data representation, hypothesis space, classes of machine learning algorithms, and algorithm categorization schemes.
This document outlines the steps for developing a predictive modeling project in Python:
1) Select an appropriate modeling technique based on the type of problem, amount of data, and other factors.
2) Prepare the data for modeling by formatting, mapping text to numbers, and splitting into features and targets.
3) Validate the model selection by evaluating performance on test data.
4) Implement the trained model in a production environment to make predictions on new data.
The document provides an overview of machine learning, including definitions of machine learning, the differences between programming and machine learning, examples of machine learning applications, and descriptions of various machine learning algorithms and techniques. It discusses supervised learning methods like classification and regression. Unsupervised learning methods like clustering are also covered. The document outlines the machine learning process and provides cautions about machine learning.
This document discusses various methods for evaluating machine learning models, including:
- Using train, test, and validation sets to evaluate models on large datasets. Cross-validation is recommended for smaller datasets.
- Accuracy, error, precision, recall, and other metrics to quantify a model's performance using a confusion matrix.
- Lift charts and gains charts provide a visual comparison of a model's performance compared to no model. They are useful when costs are associated with different prediction outcomes.
This document provides an overview of basic C programming concepts including keywords, identifiers, variables, constants, operators, characters and strings. It discusses the terminologies used in C like keywords (which are reserved words that provide meaning to the compiler), identifiers (user-defined names for variables, functions etc.), and variables (named locations in memory that store values). It also summarizes C's control flow statements like if-else, switch-case and loops. The document aims to explain the basic building blocks of C to newcomers of the language.
Slides from IEEE PEDG 2025 Conference in Nanajing. Addresses need for re-examining grid stability when using large numbers of inverter-based resources.
Citizen Observatories (COs) are initiatives that empower citizens to engage in data collection, analysis and interpretation in order to address various issues affecting their communities and contribute to policy-making and community development.
Thematic co-exploration is a co-production process where citizens actively participate alongside scientists and other actors in the exploration of specific themes.
Making them a reality involves addressing the following challenges:
Data quality and reliability
Engagement and retention of participants
Integration with policy and decision-making
本資料では、Google DeepMindの音声復元モデル「Miipher / Miipher-2」を紹介しています。Miipher-2はUSM + WaveFit構成により、テキスト不要&高速処理を実現する他、100TPUで100万時間を3日で処理するスケーラビリティも大きな特徴です。
It introduces Miipher / Miipher-2, Google DeepMind's speech enhancement and restoration models.
Miipher-2 uses a USM + WaveFit setup for text-free and efficient processing, and it scales to clean 1M hours of audio in 3 days on 100 TPUs.
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & ImpactAlqualsaDIResearchGr
Invited keynote at the Artificial Intelligence Symposium on AI-powered Research Innovation, taking place at ENSEM (L'École Nationale Supérieure d'Électricité et de Mécanique), Casablanca on June 21, 2025. I’ll be giving a keynote titled: "Generative AI & Scientific Research: Catalyst for Innovation, Ethics & Impact". Looking forward to engaging with researchers and doctoral students on how Generative AI is reshaping the future of science, from discovery to governance — with both opportunities and responsibilities in focus.
#AI hashtag#GenerativeAI #ScientificResearch #Innovation #Ethics #Keynote #AIinScience #GAI #ResearchInnovation #Casablanca
1. Thinking, Creative Thinking, Innovation
2. Societies Evolution from 1.0 to 5.0
3. AI - 3P Approach, Use Cases & Innovation
4. GAI & Creativity
5. TrustWorthy AI
6. Guidelines on The Responsible use of GAI In Research
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.Mark Billinghurst
This is lecture 4 in the course on Rapid Prototyping for XR, taught by Mark Billinghurst on June 11th, 2025. This lecture is about High Level Prototyping.
For any number of circumstances, obsolescence risk is ever present in the electronics industry. This is especially true for human-to-machine interface hardware, such as keypads, touchscreens, front panels, bezels, etc. This industry is known for its high mix and low-volume builds, critical design requirements, and high costs to requalify hardware. Because of these reasons, many programs will face end-of-life challenges both at the component level as well as at the supplier level.
Redesigns and qualifications can take months or even years, so proactively managing this risk is the best way to deter this. If an LED is obsolete or a switch vendor has gone out of business, there are options to proceed.
In this webinar, we cover options to redesign and reverse engineer legacy keypad and touchscreen designs.
For more information on our HMI solutions, visit https://www.epectec.com/user-interfaces.
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTUREShabista Imam
Ad
Machine Learning with Python- Machine Learning Algorithms.pdf
1. Machine Learning with Python
Machine Learning Algorithms
Prof.ShibdasDutta,
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
2. Machine Learning Algorithms – Classification
Classification - Introduction
Classification may be defined as the process of predicting class or category from observed values
or given data points. The categorized output can have the form such as “Black” or “White” or
“spam” or “no spam”.
Mathematically, classification is the task of approximating a mapping function (f) from input
variables (X) to output variables (Y).
It is basically belongs to the supervised machine learning in which targets are also provided
along with the input data set.
An example of classification problem can be the spam detection in emails. There can be only two
categories of output, “spam” and “no spam”; hence this is a binary type classification.
To implement this classification, we first need to train the classifier.
For this example, “spam” and “no spam” emails would be used as the training data.
After successfully train the classifier, it can be used to detect an unknown email.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
3. Types of Learners in Classification
We have two types of learners in respective to classification problems:
Lazy Learners
As the name suggests, such kind of learners waits for the testing data to be appeared after storing
the training data.
Classification is done only after getting the testing data.
They spend less time on training but more time on predicting.
Examples of lazy learners are K- nearest neighbor and case-based reasoning.
Eager Learners
As opposite to lazy learners, eager learners construct classification model without waiting
for the testing data to be appeared after storing the training data.
They spend more time on training but less time on predicting.
Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural
Networks (ANN).
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
4. Building a Classifier in Python
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The
steps for building a classifier in Python are as follows:
Step1: Importing necessary python package
For building a classifier using scikit-learn, we need to import it. We can import it by using
following script:
import sklearn
Step2: Importing dataset
After importing necessary package, we need a dataset to build classification prediction
model. We can import it from sklearn dataset or can use other one as per our
requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic
Database. We can import it with the help of following script:
from sklearn.datasets import load_breast_cancer
The following script will load the dataset;
data = load_breast_cancer()
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
5. We also need to organize the data and it can be done with the help of following scripts:
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database.
print(label_names)
The output of the above command is the names of the labels:
['malignant' 'benign']
These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0
and Benign cancer is represented by 1.
The feature names and feature values of these labels can be seen with the help
of following commands:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
6. print(feature_names[0])
The output of the above command is the names of the features for label 0 i.e. Malignant
cancer:
mean radius
Similarly, names of the features for label can be produced as follows:
print(feature_names[1])
The output of the above command is the names of the features for label 1 i.e. Benign
cancer:
mean texture
We can print the features values for these labels with the help of following command:
print(features[0])
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
7. This will give the following output:
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01 1.471e-01
2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01]
We can print the features values for these labels with the help of following command:
print(features[1])
This will give the following output:
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02 7.017e-02
1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02]
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
8. Step3: Organizing data into training & testing sets
As we need to test our model on unseen data, we will divide our dataset into two parts:
a training set and a test set. We can use train_test_split() function of sklearn
python package to split the data into sets. The following command will import the
function:
from sklearn.model_selection import train_test_split
Now, next command will split the data into training & testing data. In this example, we are
using taking 40 percent of the data for testing purpose and 60 percent of the data for training
purpose:
train, test, train_labels, test_labels =
train_test_split(features,labels,test_size = 0.40, random_state = 42)
Step4- Model evaluation
After dividing the data into training and testing we need to build the model. We will be
using Naïve Bayes algorithm for this purpose. The following commands will import
the GaussianNB module:
from sklearn.naive_bayes import GaussianNB
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
9. Now, initialize the model as follows:
gnb = GaussianNB()
Next, with the help of following command we can train the model:
model = gnb.fit(train, train_labels)
Now, for evaluation purpose we need to make predictions. It can be done by using
predict() function as follows:
preds = gnb.predict(test)
print(preds)
This will give the following output: [1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 10 1 1 11 11 0 11 111 10
1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
0 0 1 1 0 1]
The above series of 0s and 1s in output are the predicted values for the Malignant and
Benign tumor classes
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
10. Step5- Finding accuracy
We can find the accuracy of the model build in previous step by comparing the two
arrays namely test_labels and preds. We will be using the accuracy_score()
function to determine the accuracy.
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))
0.951754385965
The above output shows that NaïveBayes classifier is 95.17% accurate.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
11. Classification Evaluation Metrics
The job is not done even if you have finished implementation of your Machine Learning
application or model.
We must have to find out how effective our model is?
There can be different evaluation metrics, but we must choose it carefully because the choice of
metrics influences how the performance of a machine learning algorithm is measured and
compared.
The following are some of the important classification evaluation metrics among which
you can choose based upon your dataset and kind of problem:
Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the output
can be of two or more type of classes.
A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” and
furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False
Positives (FP)”, “False Negatives (FN)” as shown below:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
12. The explanation of the terms associated with confusion matrix are as follows:
True Positives (TP): It is the case when both actual class & predicted class of data point is 1.
True Negatives (TN): It is the case when both actual class & predicted class of data point is 0.
False Positives (FP): It is the case when actual class of data point is 0 & predicted class of
data point is 1.
•
• False Negatives (FN): It is the case when actual class of data point is 1 & predicted class of
data point is 0.
TruePositives(TP) FalsePositives(FP)
FalseNegatives(FN) TrueNegatives(TN)
Actual
1 0
1
0
Predicted
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
13. We can find the confusion matrix with the help of confusion_matrix() function of sklearn. With the help of the
following script, we can find the confusion matrix of above built binary classifier:
from sklearn.metrics import confusion_matrix
Output
[[ 73 7]
[ 4 144]]
Accuracy
It may be defined as the number of correct predictions made by our ML model. We can easily
calculate it by confusion matrix with the help of following formula:
Accuracy = (TP+TN) / (TP+FP+FN+TN)
For above built binary classifier, TP + TN = 73+144 = 217 and TP+FP+FN+TN =
73+7+4+144=228.
Hence, Accuracy = 217/228 = 0.951754385965 which is same as we have
calculated after creating our binary classifier.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
14. Precision
Precision, used in document retrievals, may be defined as the number of correct documents
returned by our ML model. We can easily calculate it by confusion matrix with the help of
following formula:
Precision = TP / (TP+FP)
For the above built binary classifier, TP = 73 and TP+FP = 73+7 = 80. Hence, Precision = 73/80 = 0.915
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can easily
calculate it by confusion matrix with the help of following formula:
Recall = TP / (TP + FN)
For above built binary classifier, TP = 73 and TP+FN = 73+4 = 77. Hence, Precision = 73/77 = 0.94805
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by
our ML model. We can easily calculate it by confusion matrix with the help of following
formula:
Specificity = TN / (TN+FP)
For the above built binary classifier, TN = 144 and TN+FP = 144+7 = 151. Hence, Precision = 144/151 = 0.95364
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
15. Applications
Some of the most important applications of classification algorithms are as follows:
Speech Recognition
Handwriting Recognition
Biometric Identification
• Document Classification
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com