Machine Learning with Python-Data Visualization.pdfSHIBDASDUTTA
This document discusses various techniques for visualizing machine learning data using Python. It describes univariate visualization methods like histograms, density plots, and box plots to understand each attribute independently. It also covers multivariate visualization techniques like correlation matrix plots and scatter matrix plots to understand interactions between multiple attributes. Examples of generating histograms, density plots, box plots, correlation matrices, and scatter matrices on a diabetes dataset are provided to illustrate how to implement these techniques in Python.
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYAMaulik Borsaniya
This document discusses data visualization and Matplotlib. It begins with an introduction to data visualization and its importance. It then covers basic visualization rules like labeling axes and adding titles. It discusses what Matplotlib is and how to install it. It provides examples of common plot types in Matplotlib like sine waves, scatter plots, bar charts, and pie charts. It also discusses working with data science and Pandas, including how to create Pandas Series and DataFrames from various data sources.
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
The document outlines a Jupyter notebooks workshop focused on machine learning utilizing Kubernetes and OpenShift, covering the entire ML workflow from data collection to model deployment. Key topics include robust pipelines, model training and validation, feature engineering, and resource management in cloud environments. The agenda includes coding demonstrations, showcasing practical implementation of these concepts.
Pandas is a powerful Python library designed for data manipulation and analysis, providing expressive data structures such as Series and DataFrames. It facilitates the processing of data through functionalities like loading, preparing, manipulating, modeling, and analyzing data across various fields. Pandas also offers efficient tools for handling missing data, reshaping datasets, and performing data preprocessing tasks essential for machine learning applications.
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
This document provides an overview of exploratory data analysis (EDA) techniques and commonly used tools. It discusses classical and Bayesian statistical analysis approaches as well as EDA. Popular Python libraries for EDA include NumPy, Pandas, Matplotlib and Seaborn. NumPy allows working with multidimensional arrays and matrices while Pandas facilitates working with structured data. The document also provides examples of creating arrays and dataframes, loading data from files, and analyzing datasets using these tools.
Gradient boosting for regression problems with example basics of regression...prateek kumar
The document discusses using gradient boosting for regression problems. Gradient boosting builds an additive model in a stage-wise fashion to minimize a loss function. It uses decision trees as weak learners that are added sequentially. The document demonstrates implementing gradient boosting in Python to predict Boston housing prices based on various attributes. It loads the dataset, trains a gradient boosting regressor model on 80% of the data, and evaluates the model on the remaining 20% with metrics showing good performance.
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET Journal
This document provides an unabridged review of supervised machine learning regression and classification techniques. It begins with an introduction to machine learning and artificial intelligence. It then describes regression and classification techniques for supervised learning problems, including linear regression, logistic regression, k-nearest neighbors, naive bayes, decision trees, support vector machines, and random forests. Practical examples are provided using Python code for applying these techniques to housing price prediction and iris species classification problems. The document concludes that the primary goal was to provide an extensive review of supervised machine learning methods.
Naive Bayes is a simple classification technique based on Bayes' theorem that assumes independence between predictors. It works well for large datasets and is easy to build. Some key points:
- It calculates the probability of class membership based on prior probabilities of classes and predictors.
- It is commonly used for text classification like spam filtering due to its speed and accuracy.
- Variants include Gaussian, Multinomial, and Bernoulli Naive Bayes for different data types.
- Limitations include its assumptions of independence and inability to tune parameters, but it remains a popular first approach for classification problems.
The document discusses learning statistics and probability concepts that are essential for data science, including descriptive statistics, distributions, hypothesis testing, regression, Bayesian thinking, and an introduction to statistical machine learning. It provides an overview of important topics in statistics for data science like linear regression, logistic regression, neural networks, and clustering. The document also includes code examples in Python for implementing simple linear regression on a sample dataset to predict profit based on population size.
ML-Ops how to bring your data science to productionHerman Wu
This document discusses end-to-end machine learning (ML) workflows and operations (MLOps) on Azure. It provides an overview of the ML lifecycle including developing and training models, validating models, deploying models, packaging models, and monitoring models. It also discusses how Azure services like Azure Machine Learning and Azure DevOps can be used to implement MLOps practices for continuous integration, delivery, and deployment of ML models. Real-world examples of automating energy demand forecasting and computer vision models are also presented.
The document discusses machine learning as a subset of artificial intelligence that enables systems to learn from data. It differentiates between supervised and unsupervised learning, and explains the concept of feature vectors and feature spaces essential for modeling tasks like classification. Additionally, it highlights various classification methods, including linear classification and support vector machines, and provides resources for further learning in the field.
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
The document provides a comprehensive guide on data manipulation using NumPy and Pandas in Python, covering array creation, indexing, slicing, and concatenation techniques in NumPy, as well as basic data frame operations in Pandas. Additionally, it summarizes a study on ethical issues in marketing, highlighting the significance of business ethics and social responsibility, and its impact on consumer purchasing behavior. The study reveals consumers’ ethical perceptions affect their buying choices, with misleading advertising and packaging being critical factors.
The document discusses essential Python toolboxes for data scientists, emphasizing the importance of selecting the right tools for effective data analysis. It highlights popular libraries such as NumPy, SciPy, Pandas, and Scikit-learn, and describes their functionalities for data manipulation, analysis, and machine learning. Additionally, it covers installation options, integrated development environments like Spyder and Jupyter, and various data handling techniques using Pandas.
The document provides an extensive overview of machine learning concepts, particularly using the scikit-learn library, covering topics such as supervised and unsupervised learning, model evaluation, and various algorithms. It outlines the process of building machine learning models, including data handling, feature extraction, and evaluation metrics, as well as discussing the architecture for operationalizing these models. Additionally, it introduces scikit-learn, its features, and the importance of proper methodology to avoid overfitting and underfitting in machine learning applications.
1. The document describes implementing a decision tree algorithm on a balance scale dataset from UCI to classify data as left, right, or balanced.
2. It discusses decision tree concepts like entropy, information gain, and building and evaluating the model in two phases - building the tree using the training data, and making predictions on test data to calculate accuracy.
3. Code is provided to import the dataset, split into train and test, train decision trees using gini index and entropy, make predictions on test data, and calculate accuracy.
Logic programming is a paradigm that formulates problems as facts and rules within formal logic. Key components include facts, which are true statements, and rules, which are logical conclusions derived from facts. The document also covers implementing logic programming in Python using libraries such as Kanren and SymPy, as well as discussing concepts such as first-class functions, higher-order functions, and predicates in logic.
R is a powerful but complex statistical programming language. It allows users great flexibility as "wizards" to customize their environment and create their own functions ("spells") without paying for licenses. Learning R involves overcoming an initial steep learning curve. There are over 2000 add-on packages available that provide many statistical techniques without delays, but the large number of packages can make choosing the best one difficult. The document provides tips for getting started with R, including installing RStudio, loading packages, and basic syntax like creating vectors and data frames. It also gives an example of analyzing rainfall data from India with R functions like aggregate, plot, and linear regression.
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
The document outlines a workshop on data science and machine learning using Python and scikit-learn, covering essential concepts such as classifiers, supervised and unsupervised learning techniques, including examples from real-world applications. It highlights the importance of data preprocessing, visualization, and the machine learning workflow with libraries like pandas and numpy. Additionally, it discusses advanced topics like hyperparameter tuning and cross-validation for model evaluation.
The document outlines best practices for data observability using tools such as Pandas, Scikit-learn, and PySpark, highlighting the author's 20 years of experience in software engineering and data governance. It emphasizes the importance of understanding data's impact on system behavior and introduces the 'at the source' pattern for improving data pipeline observability. Additionally, it provides code examples and resources for implementing these practices effectively.
The document discusses dimensionality reduction techniques by Ms. Aakanksha Jain, focusing on feature selection and extraction, specifically highlighting backward feature elimination and forward feature selection. It emphasizes the importance of dimension reduction in data analysis and machine learning, demonstrating the concepts with practical Python implementations. The document also includes a hands-on session showcasing a dataset and utilizing specific Python libraries for feature selection.
Introduction to ML_Data Preprocessing.pptxmousmiin
The document provides an overview of machine learning and data preprocessing techniques essential for preparing data before applying machine learning models. It details steps involved in data preprocessing such as importing libraries, loading datasets, performing statistical analysis, checking for outliers, and normalizing or standardizing data. Key concepts include the significance of data formatting for model execution and the use of specific algorithms to manage null values and format datasets appropriately.
The document provides a technical overview of Azure Machine Learning, detailing the requirements, lifecycle processes, and capabilities for deploying machine learning applications, including deep learning solutions. It covers the setup and use of Azure Machine Learning service, including workspace management, experiment tracking, and model deployment, underpinned by practical coding examples. The text also highlights various real-world applications for machine learning and outlines Azure's infrastructure for building, training, and deploying models efficiently.
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
The document explains the implementation of supervised machine learning (ML) using Python, defining key terms such as tasks, performance measures, training and testing sets, feature extraction, and types of ML. It outlines the steps required, including downloading Python, installing essential libraries, creating feature and label sets, selecting appropriate algorithms, and evaluating accuracy. The focus is on recognizing handwritten words and sums, exemplifying the process with a simple addition problem and providing a workflow for approaching ML projects.
This document discusses various classification algorithms including logistic regression, Naive Bayes, support vector machines, k-nearest neighbors, decision trees, and random forests. It provides examples of using logistic regression and support vector machines for classification tasks. For logistic regression, it demonstrates building a model to classify handwritten digits from the MNIST dataset. For support vector machines, it uses a banknote authentication dataset to classify currency notes as authentic or fraudulent. The document discusses evaluating model performance using metrics like confusion matrix, accuracy, precision, recall, and F1 score.
Multi-proposer consensus protocols let multiple validators propose blocks in parallel, breaking the single-leader throughput bottleneck of classic designs. Yet the modern multi-proposer consensus implementation has grown a lot since HotStuff. THisworkshop will explore the implementation details of recent advances – DAG-based approaches like Narwhal and Sui’s Mysticeti – and reveal how implementation details translate to real-world performance gains. We focus on the nitty-gritty: how network communication patterns and data handling affect throughput and latency. New techniques such as Turbine-like block propagation (inspired by Solana’s erasure-coded broadcast) and lazy push gossip broadcasting dramatically cut communication overhead. These optimizations aren’t just theoretical – they enable modern blockchains to process over 100,000 transactions per second with finality in mere milliseconds redefining what is possible in decentralized systems.
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET Journal
This document provides an unabridged review of supervised machine learning regression and classification techniques. It begins with an introduction to machine learning and artificial intelligence. It then describes regression and classification techniques for supervised learning problems, including linear regression, logistic regression, k-nearest neighbors, naive bayes, decision trees, support vector machines, and random forests. Practical examples are provided using Python code for applying these techniques to housing price prediction and iris species classification problems. The document concludes that the primary goal was to provide an extensive review of supervised machine learning methods.
Naive Bayes is a simple classification technique based on Bayes' theorem that assumes independence between predictors. It works well for large datasets and is easy to build. Some key points:
- It calculates the probability of class membership based on prior probabilities of classes and predictors.
- It is commonly used for text classification like spam filtering due to its speed and accuracy.
- Variants include Gaussian, Multinomial, and Bernoulli Naive Bayes for different data types.
- Limitations include its assumptions of independence and inability to tune parameters, but it remains a popular first approach for classification problems.
The document discusses learning statistics and probability concepts that are essential for data science, including descriptive statistics, distributions, hypothesis testing, regression, Bayesian thinking, and an introduction to statistical machine learning. It provides an overview of important topics in statistics for data science like linear regression, logistic regression, neural networks, and clustering. The document also includes code examples in Python for implementing simple linear regression on a sample dataset to predict profit based on population size.
ML-Ops how to bring your data science to productionHerman Wu
This document discusses end-to-end machine learning (ML) workflows and operations (MLOps) on Azure. It provides an overview of the ML lifecycle including developing and training models, validating models, deploying models, packaging models, and monitoring models. It also discusses how Azure services like Azure Machine Learning and Azure DevOps can be used to implement MLOps practices for continuous integration, delivery, and deployment of ML models. Real-world examples of automating energy demand forecasting and computer vision models are also presented.
The document discusses machine learning as a subset of artificial intelligence that enables systems to learn from data. It differentiates between supervised and unsupervised learning, and explains the concept of feature vectors and feature spaces essential for modeling tasks like classification. Additionally, it highlights various classification methods, including linear classification and support vector machines, and provides resources for further learning in the field.
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
The document provides a comprehensive guide on data manipulation using NumPy and Pandas in Python, covering array creation, indexing, slicing, and concatenation techniques in NumPy, as well as basic data frame operations in Pandas. Additionally, it summarizes a study on ethical issues in marketing, highlighting the significance of business ethics and social responsibility, and its impact on consumer purchasing behavior. The study reveals consumers’ ethical perceptions affect their buying choices, with misleading advertising and packaging being critical factors.
The document discusses essential Python toolboxes for data scientists, emphasizing the importance of selecting the right tools for effective data analysis. It highlights popular libraries such as NumPy, SciPy, Pandas, and Scikit-learn, and describes their functionalities for data manipulation, analysis, and machine learning. Additionally, it covers installation options, integrated development environments like Spyder and Jupyter, and various data handling techniques using Pandas.
The document provides an extensive overview of machine learning concepts, particularly using the scikit-learn library, covering topics such as supervised and unsupervised learning, model evaluation, and various algorithms. It outlines the process of building machine learning models, including data handling, feature extraction, and evaluation metrics, as well as discussing the architecture for operationalizing these models. Additionally, it introduces scikit-learn, its features, and the importance of proper methodology to avoid overfitting and underfitting in machine learning applications.
1. The document describes implementing a decision tree algorithm on a balance scale dataset from UCI to classify data as left, right, or balanced.
2. It discusses decision tree concepts like entropy, information gain, and building and evaluating the model in two phases - building the tree using the training data, and making predictions on test data to calculate accuracy.
3. Code is provided to import the dataset, split into train and test, train decision trees using gini index and entropy, make predictions on test data, and calculate accuracy.
Logic programming is a paradigm that formulates problems as facts and rules within formal logic. Key components include facts, which are true statements, and rules, which are logical conclusions derived from facts. The document also covers implementing logic programming in Python using libraries such as Kanren and SymPy, as well as discussing concepts such as first-class functions, higher-order functions, and predicates in logic.
R is a powerful but complex statistical programming language. It allows users great flexibility as "wizards" to customize their environment and create their own functions ("spells") without paying for licenses. Learning R involves overcoming an initial steep learning curve. There are over 2000 add-on packages available that provide many statistical techniques without delays, but the large number of packages can make choosing the best one difficult. The document provides tips for getting started with R, including installing RStudio, loading packages, and basic syntax like creating vectors and data frames. It also gives an example of analyzing rainfall data from India with R functions like aggregate, plot, and linear regression.
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
The document outlines a workshop on data science and machine learning using Python and scikit-learn, covering essential concepts such as classifiers, supervised and unsupervised learning techniques, including examples from real-world applications. It highlights the importance of data preprocessing, visualization, and the machine learning workflow with libraries like pandas and numpy. Additionally, it discusses advanced topics like hyperparameter tuning and cross-validation for model evaluation.
The document outlines best practices for data observability using tools such as Pandas, Scikit-learn, and PySpark, highlighting the author's 20 years of experience in software engineering and data governance. It emphasizes the importance of understanding data's impact on system behavior and introduces the 'at the source' pattern for improving data pipeline observability. Additionally, it provides code examples and resources for implementing these practices effectively.
The document discusses dimensionality reduction techniques by Ms. Aakanksha Jain, focusing on feature selection and extraction, specifically highlighting backward feature elimination and forward feature selection. It emphasizes the importance of dimension reduction in data analysis and machine learning, demonstrating the concepts with practical Python implementations. The document also includes a hands-on session showcasing a dataset and utilizing specific Python libraries for feature selection.
Introduction to ML_Data Preprocessing.pptxmousmiin
The document provides an overview of machine learning and data preprocessing techniques essential for preparing data before applying machine learning models. It details steps involved in data preprocessing such as importing libraries, loading datasets, performing statistical analysis, checking for outliers, and normalizing or standardizing data. Key concepts include the significance of data formatting for model execution and the use of specific algorithms to manage null values and format datasets appropriately.
The document provides a technical overview of Azure Machine Learning, detailing the requirements, lifecycle processes, and capabilities for deploying machine learning applications, including deep learning solutions. It covers the setup and use of Azure Machine Learning service, including workspace management, experiment tracking, and model deployment, underpinned by practical coding examples. The text also highlights various real-world applications for machine learning and outlines Azure's infrastructure for building, training, and deploying models efficiently.
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
The document explains the implementation of supervised machine learning (ML) using Python, defining key terms such as tasks, performance measures, training and testing sets, feature extraction, and types of ML. It outlines the steps required, including downloading Python, installing essential libraries, creating feature and label sets, selecting appropriate algorithms, and evaluating accuracy. The focus is on recognizing handwritten words and sums, exemplifying the process with a simple addition problem and providing a workflow for approaching ML projects.
This document discusses various classification algorithms including logistic regression, Naive Bayes, support vector machines, k-nearest neighbors, decision trees, and random forests. It provides examples of using logistic regression and support vector machines for classification tasks. For logistic regression, it demonstrates building a model to classify handwritten digits from the MNIST dataset. For support vector machines, it uses a banknote authentication dataset to classify currency notes as authentic or fraudulent. The document discusses evaluating model performance using metrics like confusion matrix, accuracy, precision, recall, and F1 score.
Multi-proposer consensus protocols let multiple validators propose blocks in parallel, breaking the single-leader throughput bottleneck of classic designs. Yet the modern multi-proposer consensus implementation has grown a lot since HotStuff. THisworkshop will explore the implementation details of recent advances – DAG-based approaches like Narwhal and Sui’s Mysticeti – and reveal how implementation details translate to real-world performance gains. We focus on the nitty-gritty: how network communication patterns and data handling affect throughput and latency. New techniques such as Turbine-like block propagation (inspired by Solana’s erasure-coded broadcast) and lazy push gossip broadcasting dramatically cut communication overhead. These optimizations aren’t just theoretical – they enable modern blockchains to process over 100,000 transactions per second with finality in mere milliseconds redefining what is possible in decentralized systems.
Citizen Observatories (COs) are initiatives that empower citizens to engage in data collection, analysis and interpretation in order to address various issues affecting their communities and contribute to policy-making and community development.
Thematic co-exploration is a co-production process where citizens actively participate alongside scientists and other actors in the exploration of specific themes.
Making them a reality involves addressing the following challenges:
Data quality and reliability
Engagement and retention of participants
Integration with policy and decision-making
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management ProcessIJDKP
Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. There is an urgent need for a new generation of computational theories and tools to assist researchers in extracting useful information from the rapidly growing volumes of digital data.
This Journal provides a forum for researchers who address this issue and to present their work in a peer-reviewed open access forum. Authors are solicited to contribute to the Journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
International Journal of Advanced Information Technology (IJAIT)ijait
International journal of advanced Information technology (IJAIT) is a bi monthly open access peer-
reviewed journal, will act as a major forum for the presentation of innovative ideas, approaches,
developments, and research projects in the area advanced information technology applications and
services. It will also serve to facilitate the exchange of information between researchers and industry
professionals to discuss the latest issues and advancement in the area of advanced IT. Core areas of
advanced IT and multi-disciplinary and its applications will be covered during the conferences.
本資料では、Google DeepMindの音声復元モデル「Miipher / Miipher-2」を紹介しています。Miipher-2はUSM + WaveFit構成により、テキスト不要&高速処理を実現する他、100TPUで100万時間を3日で処理するスケーラビリティも大きな特徴です。
It introduces Miipher / Miipher-2, Google DeepMind's speech enhancement and restoration models.
Miipher-2 uses a USM + WaveFit setup for text-free and efficient processing, and it scales to clean 1M hours of audio in 3 days on 100 TPUs.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.Mark Billinghurst
This is lecture 4 in the course on Rapid Prototyping for XR, taught by Mark Billinghurst on June 11th, 2025. This lecture is about High Level Prototyping.
Deep Learning for Image Processing on 16 June 2025 MITS.pptxresming1
This covers how image processing or the field of computer vision has advanced with the advent of neural network architectures ranging from LeNet to Vision transformers. It covers how deep neural network architectures have developed step-by-step from the popular CNNs to ViTs. CNNs and its variants along with their features are described. Vision transformers are introduced and compared with CNNs. It also shows how an image is processed to be given as input to the vision transformer. It give the applications of computer vision.
Structured Programming with C++ :: Kjell BackmanShabista Imam
Step into the world of high-performance programming with the Complete Guidance Book of C++ Programming—a definitive resource for mastering one of the most powerful and versatile languages in computer science.
Whether you're a beginner looking to learn the fundamentals or an intermediate developer aiming to sharpen your skills, this book walks you through C++ from the ground up. You'll start with basics like variables, control structures, and functions, then progress to object-oriented programming (OOP), memory management, file handling, templates, and the Standard Template Library (STL).
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & ImpactAlqualsaDIResearchGr
Invited keynote at the Artificial Intelligence Symposium on AI-powered Research Innovation, taking place at ENSEM (L'École Nationale Supérieure d'Électricité et de Mécanique), Casablanca on June 21, 2025. I’ll be giving a keynote titled: "Generative AI & Scientific Research: Catalyst for Innovation, Ethics & Impact". Looking forward to engaging with researchers and doctoral students on how Generative AI is reshaping the future of science, from discovery to governance — with both opportunities and responsibilities in focus.
#AI hashtag#GenerativeAI #ScientificResearch #Innovation #Ethics #Keynote #AIinScience #GAI #ResearchInnovation #Casablanca
1. Thinking, Creative Thinking, Innovation
2. Societies Evolution from 1.0 to 5.0
3. AI - 3P Approach, Use Cases & Innovation
4. GAI & Creativity
5. TrustWorthy AI
6. Guidelines on The Responsible use of GAI In Research
Call For Papers - 17th International Conference on Wireless & Mobile Networks...hosseinihamid192023
17th International Conference on Wireless & Mobile Networks (WiMoNe 2025) will provide
an excellent international forum for sharing knowledge and results in theory, methodology and
applications of Wireless & Mobile computing Environment. Current information age is witnessing
a dramatic use of digital and electronic devices in the workplace and beyond. Wireless, Mobile
Networks & its applications had received a significant and sustained research interest in terms of
designing and deploying large scale and high performance computational applications in real life.
The aim of the conference is to provide a platform to the researchers and practitioners from both
academia as well as industry to meet and share cutting-edge development in the field.
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...resming1
This covers CST413 KTU S7 CSE Machine Learning Module 4 topics - Clustering, K Means clustering, Hierarchical Agglomerative clustering, Principal Component Analysis, and Expectation Maximization.
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...resming1
Ad
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
1. Machine Learning with Python
Machine Learning Algorithms - Naïve Bayes
Prof.ShibdasDutta,
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
2. Machine Learning Algorithms – Classification Algo- Naïve Bayes
Naïve Bayes - Introduction
Naïve Bayes algorithms is a classification technique based on applying Bayes’ theorem
with a strong assumption that all the predictors are independent to each other.
In simple words, the assumption is that the presence of a feature in a class is
independent to the presence of any other feature in the same class.
For example, a phone may be considered as smart if it is having touch screen, internet
facility, good camera etc.
Though all these features are dependent on each other, they contribute independently
to the probability of that the phone is a smart phone.
In Bayesian classification, the main interest is to find the posterior probabilities i.e. the
probability of a label given some observed features, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠).
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
3. With the help of Bayes theorem, we can express this in quantitative form as follows:
𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) = 𝑃(𝐿)𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 | 𝐿) / 𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)
Here, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the posterior probability of class.
𝑃(𝐿) is the prior probability of class.
𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 | 𝐿) is the likelihood which is the probability of predictor given class.
𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the prior probability of predictor.
So let’s get introduced to the Bayes Theorem first.
Bayes Theorem is used to find the probability of an event occurring given the probability of another
event that has already occurred. Here B is the evidence and A is the hypothesis. Here P(A) is known
as prior, P(A/B) is posterior, and P(B/A) is the likelihood.
4. The name Naive is used because the presence of one independent feature doesn’t affect
(influence or change the value of) other features.
The most important assumption that Naive Bayes makes is that all the features are
independent of each other.
Being less prone to overfitting, Naive Bayes algorithm works on Bayes theorem to predict
unknown data sets.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
5. Building model using Naïve Bayes in Python
Python library, Scikit learn is the most useful library that helps us to build a Naïve Bayes model
in Python. We have the following three types of Naïve Bayes model under Scikit learn Python
library:
Gaussian Naïve Bayes
It is the simplest Naïve Bayes classifier having the assumption that the data from each label is
drawn from a simple Gaussian distribution.
Multinomial Naïve Bayes
Another useful Naïve Bayes classifier is Multinomial Naïve Bayes in which the features are
assumed to be drawn from a simple Multinomial distribution. Such kind of Naïve Bayes are most
appropriate for the features that represents discrete counts.
Bernoulli Naïve Bayes
Another important model is Bernoulli Naïve Bayes in which features are assumed to be
binary (0s and 1s). Text classification with ‘bag of words’ model can be an application
of Bernoulli Naïve Bayes.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
6. NAIVE BAYES IMPLEMENTATION (EXAMPLE)
Classify whether a given person is a male or a female based on the measured features. The features include height,
weight, and foot size.
Now, defining a dataframe which consists if above provided data.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
7. import pandas as pd
import numpy as np
# Create an empty dataframe
data = pd.DataFrame()
# Create our target variable
data['Gender'] = ['male','male','male','male','female','female','female','female']
# Create our feature variables
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]
Creating another data frame containing the feature value of height as 6 feet, weight as 130 lbs and foot size as 8
inches. using Naive Bayes , trying to find whether the gender is male or female.
# Create an empty dataframe
person = pd.DataFrame()
# Create some feature values for this single row
person['Height'] = [6]
person['Weight'] = [130]
person['Foot_Size'] = [8]
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
8. Calculating the total number of males and females and their probabilities i.e priors:
# Number of males
n_male = data['Gender'][data['Gender'] == 'male'].count()
# Number of females
n_female = data['Gender'][data['Gender'] == 'female'].count()
# Total rows
total_ppl = data['Gender'].count()
# Number of males divided by the total rows
P_male = n_male/total_ppl
# Number of females divided by the total rows
P_female = n_female/total_ppl
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
9. Calculating mean and variance of male and female of the feature height, weight and foot size.
# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Gender').mean()
# Group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Gender').var()
FORMULA
posterior (male) = P(male)*P(height|male)*P(weight|male)*P(foot size|male) / evidence
posterior (female) = P(female)*P(height|female)*P(weight|female)*P(foot size|female) / evidence
Evidence = P(male)*P(height|male)*P(weight|male)*P(foot size|male) + P(female) * P(height|female) *
P(weight|female)*P(foot size|female)
The evidence may be ignored since it is a positive constant. (Normal distributions are always positive.)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
11. Calculation of P(height | Male )
mean of the height of male = 5.855
variance ( Square of S.D.) of the height of a male is square of 3.5033e-02
and x i.e. given height is 6 feet
Substituting the values in the above equation we get P(height | Male ) = 1.5789
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):
# Input the arguments into a probability density function
p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
# return p
return p
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
12. Similarly,
P(weight|male) = 5.9881e-06
P(foot size|male) = 1.3112e-3
P(height|female) = 2.2346e-1
P(weight|female) = 1.6789e-2
P(foot size|female) = 2.8669e-1
Posterior (male)*evidence = P(male)*P(height|male)*P(weight|male)*P(foot size|male) = 6.1984e-09
Posterior (female)*evidence = P(female)*P(height|female)*P(weight|female)*P(foot size|female)= 5.3778e-04
CONCLUSION
Since Posterior (female)*evidence > Posterior (male)*evidence, the sample is female.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
13. NAIVE BAYES USING SCIKIT-LEARN
import pandas as pd
import numpy as np
# Create an empty dataframe
data = pd.DataFrame()
# Create our target variable
data['Gender'] = [1,1,1,1,0,0,0,0] #1 is male
# Create our feature variables
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]
# View the data
data
Though we have very small dataset, we are dividing the dataset into train and test do that it can be used in other
model prediction. We are importing gnb() from sklearn and we are training the model with out dataset.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
14. X = data.drop(['Gender'],axis=1)
y=data.Gender
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set
y_pred = gnb.predict(X_test)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
15. from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y, gnb.predict(X)))
cm = confusion_matrix(y, gnb.predict(X))
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
16. Now, our model is ready. Let’s use this model to predict on new data.
# Create our target variable
data1 = pd.DataFrame()
# Create our feature variables
data1['Height'] = [6]
data1['Weight'] = [130]
data1['Foot_Size'] = [8]
y_pred = gnb.predict(data1)
if y_pred==0:
print ("female")
else:
print ("male")
Output: Female
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
17. EXAMPLE 2 HOME TASK - THIS IS ALSO INCLUDE IN PROJECT
AGE INCOME STUDENT CREDIT BUY COMPUTER
Youth High No Fair No
Youth High No Excellent No
Middle Age High No Fair Yes
Senior Medium No Fair Yes
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Middle Age Low Yes Excellent Yes
Youth Medium No Fair No
Youth Low Yes Fair Yes
Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes
Middle Age Medium No Excellent Yes
Middle Age High Yes Fair Yes
Senior Medium No Excellent No
Given a table that contains a dataset about age, income, student, credit-rating, buying a computer, and their respective
features. From the above dataset, we need to find whether a youth student with medium income having a fair credit
rating buys a computer or not. I.e. B = (Youth, Medium, Yes, Fair)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
18. Pros & Cons
Pros
The followings are some pros of using Naïve Bayes classifiers:
Naïve Bayes classification is easy to implement and fast.
It will converge faster than discriminative models like logistic regression.
It requires less training data.
It is highly scalable in nature, or they scale linearly with the number of predictors and data points.
It can make probabilistic predictions and can handle continuous as well as discrete data.
Naïve Bayes classification algorithm can be used for binary as well as multi-class classification problems
both.
Cons
The followings are some cons of using Naïve Bayes classifiers:
One of the most important cons of Naïve Bayes classification is its strong feature independence because in
real life it is almost impossible to have a set of features which are completely independent of each other.
Another issue with Naïve Bayes classification is its ‘zero frequency’ which means that if a categorial variable
has a category but not being observed in training data set, then Naïve Bayes model will assign a zero
probability to it and it will be unable to make a prediction.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
19. Applications of Naïve Bayes classification
The following are some common applications of Naïve Bayes classification:
Real-time prediction: Due to its ease of implementation and fast computation, it can be used
to do prediction in real-time.
Multi-class prediction: Naïve Bayes classification algorithm can be used to predict posterior
probability of multiple classes of target variable.
Text classification: Due to the feature of multi-class prediction, Naïve Bayes classification
algorithms are well suited for text classification. That is why it is also used to solve problems like
spam-filtering and sentiment analysis.
Recommendation system: Along with the algorithms like collaborative filtering, Naïve
Bayes makes a Recommendation system which can be used to filter unseen information
and to predict weather a user would like the given resource or not.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com