SlideShare a Scribd company logo
Machine learning module 2
Module 2
Machine Learning Activities
Understand the type of data in the given input data set.
Explore the data to understand the nature and quality.
Explore the relationships amongst the data elements
Find potential issues in data.
Do the necessary remediations (impute missing data
values, etc.,)
Activity cont...
Apply pre-processing steps.
The input data is first divided into parts(The training data and
The testing data)
Consider different models or learning algorithms for selection.
Train the model based on the training data for supervised
learning problem and apply to unknown data.
Activity cont...
Directly apply the chosen unsupervised model on the input
data for unsupervised learning problem.
Basic Data Types
Data can be categorized into 4 basic
types from a Machine Learning
perspective: numerical data, categorical
data, time series data, and text.
Numerical and Categorial Data
Numerical Data
Numerical data is any data
where data points are exact
numbers. Statisticians also
might call numerical data,
quantitative data.
Exploring Numerical Data
There exists two major mathematical plot methods to
explore numerical data:
•Box plot
•Histogram
Exploring Cont...
Understanding Central tendency:
For understanding the nature of data(Numeric variables) we
need to apply measure of central tendency.
Mean: It is the sum of all data values divided by the count of all
data elements.
Median: It is the middle value. Median splits the dataset in to
half.
Mode: It is the most frequently occuring value in the data set.
Exploring Cont...
Measuring the Dispersion of Data (Range, Quartiles, Interquartile
Range):
Let x1,x2....,xN be a set of observations for some numeric attribute, X.
The range of the set is the difference between the largest(max()) and
the smallest (min()) values.
Quartiles: are points taken at regular intervals of data distribution,
dividing it into essentially equal size consecutive sets.
Interquartile range: The distance between the first and third quartiles
is a measure of spread that gives the range covered by the middle
half of the data.
Variance and Standard Deviation
These are measures of data dispersion. And it indicates that
how spread out a data distribution is.
A low standard deviation means that the data observations
observations tend to be very close to the mean, while high
high standard deviation indicates that the data are spread out
spread out over a large range of values.
Categorical Data
Categorical data represents
characteristics, such as a hockey
player’s position, team, hometown .
Time Series
Data
Time series data is a
sequence of numbers
collected at regular
intervals over some
period of time.
Text Data
Text data is basically just words.
Relationship between variables
Scatter-plots and two-way cross tabulation can be
effectively used.
Scatter- plots: a graph in which the values of two variables are
plotted along two axes, the pattern of the resulting points
revealing any correlation present.
Relationship Cont...
Two-way cross tabulation: It is also known as cross-tab, are
used to understand the relationship of two categorical attributes
in a concise way.
It has a matrix format that presents a summarized view of the
bivariate frequency distribution. It is much similar to scatter plot,
helps to understand how much the data values of the attribute
changes with the change in data values of another attributes.
Data Issues
Day by day we are generating tremendous amount of
data. Dealing with big data is much more complicated.
Real-world databases are highly susceptible to noisy,
missing, and inconsistent data due to their typically huge
size (often several gigabytes or more) and their likely origin
from multiple, heterogenous sources
Issues cont...
In accurate, incomplete, and inconsistent data are common-
place properties of large real-world databases and warehouses.
Main reasons for inaccurate data
• Having incorrect attribute values.
• The data collection instruments used may be faulty.
• There may have been human or computer errors
occurring at data entry.
Issues cont...
• Users may purposely submit incorrect data values for
mandatory fields when they do not wish to submit
personal information.
• Errors in data transmission can also occur.
• Inconsistent formats for input fields.
Remedies
Handling Outliers: Outliers are data elements with an
abnormally high value which may impact prediction accuracy.
•Remove outliers: If the outliers for the specific record is not
many, simple way is to remove.
•Imputation: impute the values with mean or median or mode.
•Capping: For values that lie outside the 1.5|x|IQR limits, we can
cap them by replacing those observations below the lower limit
with the value of 5th percentile and those that lie above upper
limit, with the value of 95th percentile.
Remedies Cont...
Handling Missing Values:
• Eliminate records having a missing value of data elements.
• Imputing missing values using mean/median/mode.
• Fill the missing value manually.
• Use the global constant to fill the missing value.
• Use the most probable value to fill in the missing value.
Major tasks in pre-processing
Data cleaning: routines work to “clean” the data by filling
in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies.
Data Integration: Integrating data from different sources
Machine learning module 2
Pre Processing Cont...
Data Transformation: It is the process of converting data
from one format to another.
Data reduction: obtains a reduced representation of the
data set that is much smaller in volume, yet produces the
same (or almost the same) analytical results. Data
reduction strategies include dimensionality reduction and
numerosity reduction.
Model
Abstraction is a significant step as it represents raw input
data in a summarized and structured format, such that a
meaningful insight is obtained from the data. This
structured representation of raw input data to the
meaningful pattern is called a Model.
Model Selection
Models for supervised learning try to predict certain values
using the input data set.
Models for unsupervised learning used to describe a data
set or gain insight from a data set.
Model Training
The process of assigning a model, and fitting a specific
model to a data set is called model Training.
Bias: If the outcome of a model is systematically incorrect,
the learning is said to have a bias.
Model Representation &
Interpretability
Fitness of a target function approximated by a learning
algorithm determines how correctly it is able to classify a
set of data it has never seen.
Underfitting:
If the target function is kept too simple, it may not be able to
capture the essential nuances and represent the underlying
data well. This is known as underfitting.
Model Representation &
Interpretability Cont...
Overfitting:
Where the model has been designed in such a way that it
emulates the training data too closely. In such a case any
specific nuance in the training data, like noise or outliers,
gets embedded in the model. It adversely impacts the
performance of the model on the test data.
Model Representation &
Interpretability Cont...
Bias and Variance:(Supervised learning)
Errors due to bias arise from simplifying assumptions made
by the model whereas errors due to variance occur from
over-aligning the model with the training data sets.
Training a model.
Model evaluation aims to estimate the generalization
accuracy of a model on future data.
There exists two methods for evaluating model's
performance:
• Holdout
• Cross-validation
Training a model
Holdout: It tests a model on different data than it was
trained on. In this method the data set is divided into three
subsets:
• Training set: is a subset of the dataset used to build
predictive models.
• Validation set: is a subset of the dataset used to assess
the performance of the model built in the training phase.
Training a model con...
• Test set(unseen data): is a subset of the dataset used to
assess the likely future performance of a model.
The holdout approach is useful because of its speed,
simplicity, and flexibility.
Training a Model con..
Cross-Validation: It partitions the original observation
dataset into a training set, used to train the model, and an
independent set used to evaluate the analysis.
The most common cross-validation technique is K-fold
cross-validation, here original dataset is partitioned into k
equal size subsamples, called folds.
Training a Model con..
Bootstrap sampling: It is a popular way to identify training
and test data sets from the input data set. It uses the
technique of Simple Random Sampling with
Replacement(SRSWR). Bootstrapping randomly picks data
instances from the input data set, with the possibility of the
same data instance to be picked multiple times.
Evaluating performance of a model.
Classification Accuracy: Accuracy is a common evaluation
metric for classification problems. It's the number of correct
predictions made as a ratio of all predictions made.
Cross-Validation techniques can also be used to compare the
performance of different machine learning models on the same
data set and also be helpful in selecting the values for a
model's parameters that maximize the accuracy of the model-
also known as parameter tuning.
Evaluating performance of a model.
Confusion Matrix: It provides a more detailed breakdown of
correct and incorrect classification for each class.
Logarithmic Loss(logloss): measures the performance of a
classification model where the prediction input is a probability
value between 0 and 1.
Area under Curve(AUC): is a performance metric for
measuring the ability of binary classifier to discriminate
between positive and negative classes.
Evaluating performance of a model.
F-Measure: is a measure of a test's accuracy that
considers both the precision and recall of the test to
compute the score.
Precision is the number of correct positive results divided
by the total predicted positive observations.
Recall is the number of positive results divided by the
number of all relevant samples.
Feature Engineering
A feature is an attribute of a data set that is used in
machine learning process.
Feature engineering is an important pre-processing step
for machine learning, having two major elements
• Feature transformation
• Feature sub-set selection
Feature Engineering cont...
Feature Transformation: It transforms data into a new set of
features which can represent the underling machine learning
problem.
• Feature Construction
• Feature Extraction
Feature construction process discovers missing information
about the relationships between features and augments.
Feature Engineering cont...
Feature Extraction: Is the process of extracting or
creating a new set of features from the original set of
features using some functional mapping.
Examples: Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Linear Discriminant Analysis (LDA).
Thank You

More Related Content

PPTX
Machine learning
PPTX
Introduction to machine learning
PDF
The Business Case for Applied Artificial Intelligence
PPTX
Machine Learning
PPTX
Predict Breast Cancer using Deep Learning
PDF
Machine learning
PDF
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
PPTX
supervised learning
Machine learning
Introduction to machine learning
The Business Case for Applied Artificial Intelligence
Machine Learning
Predict Breast Cancer using Deep Learning
Machine learning
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
supervised learning

What's hot (20)

PDF
Deep Learning - The Past, Present and Future of Artificial Intelligence
PPTX
Deep fake
PPTX
Machine learning
PDF
Machine learning vs deep learning
PDF
Deepfake detection
PPTX
Introduction to Machine Learning
PPTX
Decision Tree - C4.5&CART
PPTX
Introduction To Machine Learning
PDF
Supervised Machine Learning With Types And Techniques
PPTX
Machine learning
PPTX
Data scientist roadmap
PDF
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
PPTX
Introduction to deep learning
PPTX
Supervised learning
PPTX
Classification and Regression
PPTX
Data cleaning and visualization
PDF
Introduction to the Artificial Intelligence and Computer Vision revolution
PDF
Machine Learning
PPTX
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
PPTX
image classification
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep fake
Machine learning
Machine learning vs deep learning
Deepfake detection
Introduction to Machine Learning
Decision Tree - C4.5&CART
Introduction To Machine Learning
Supervised Machine Learning With Types And Techniques
Machine learning
Data scientist roadmap
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Introduction to deep learning
Supervised learning
Classification and Regression
Data cleaning and visualization
Introduction to the Artificial Intelligence and Computer Vision revolution
Machine Learning
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
image classification
Ad

Similar to Machine learning module 2 (20)

PPTX
Introduction to data mining
DOCX
PDF
Top 20 Data Science Interview Questions and Answers in 2023.pdf
PDF
Data Science Interview Questions PDF By ScholarHat
PPTX
Singular Value Decomposition (SVD).pptx
PPTX
EDAB Module 5 Singular Value Decomposition (SVD).pptx
PPTX
Data Preprocessing
PDF
ML-Unit-4.pdf
PPTX
Data Science- Data Preprocessing, Data Cleaning.
PPTX
pjgjhkjhkjhkkhkhkkhkjhjhjhjkhjhjkhjhroject.pptx
DOCX
Concept of Classification in Data Mining.docx
PDF
Knowledge discovery claudiad amato
PDF
Machine Learning - Deep Learning
PPT
5_Model for Predictions_Machine_Learning.ppt
PPTX
Machine Learning.pptx
PDF
Exploratory Data Analysis - Satyajit.pdf
PPTX
data science, prior knowledge ,modeling, scatter plot
PPTX
Classification
PPTX
Research methodology-Research Report
PPTX
Research Methodology-Data Processing
Introduction to data mining
Top 20 Data Science Interview Questions and Answers in 2023.pdf
Data Science Interview Questions PDF By ScholarHat
Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Data Preprocessing
ML-Unit-4.pdf
Data Science- Data Preprocessing, Data Cleaning.
pjgjhkjhkjhkkhkhkkhkjhjhjhjkhjhjkhjhroject.pptx
Concept of Classification in Data Mining.docx
Knowledge discovery claudiad amato
Machine Learning - Deep Learning
5_Model for Predictions_Machine_Learning.ppt
Machine Learning.pptx
Exploratory Data Analysis - Satyajit.pdf
data science, prior knowledge ,modeling, scatter plot
Classification
Research methodology-Research Report
Research Methodology-Data Processing
Ad

More from Gokulks007 (15)

PPTX
Machine learning workshop, CZU Prague 2024
PPTX
Elearning week12
PPTX
Elearning week11
PPTX
Elearning week10
PPTX
Elearning week9
PPTX
Elearning week8
PPTX
Elearning week7
PPTX
Elearning week6
PPTX
Elearning week5
PPTX
Elearning week4
PPTX
Elearning week3
PPTX
E learning week2
PPTX
E learning week1
PPTX
Machine Learning
PPTX
Text Mining
Machine learning workshop, CZU Prague 2024
Elearning week12
Elearning week11
Elearning week10
Elearning week9
Elearning week8
Elearning week7
Elearning week6
Elearning week5
Elearning week4
Elearning week3
E learning week2
E learning week1
Machine Learning
Text Mining

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Getting Started with Data Integration: FME Form 101
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Machine Learning_overview_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Group 1 Presentation -Planning and Decision Making .pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Getting Started with Data Integration: FME Form 101

Machine learning module 2

  • 3. Machine Learning Activities Understand the type of data in the given input data set. Explore the data to understand the nature and quality. Explore the relationships amongst the data elements Find potential issues in data. Do the necessary remediations (impute missing data values, etc.,)
  • 4. Activity cont... Apply pre-processing steps. The input data is first divided into parts(The training data and The testing data) Consider different models or learning algorithms for selection. Train the model based on the training data for supervised learning problem and apply to unknown data.
  • 5. Activity cont... Directly apply the chosen unsupervised model on the input data for unsupervised learning problem.
  • 6. Basic Data Types Data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time series data, and text.
  • 8. Numerical Data Numerical data is any data where data points are exact numbers. Statisticians also might call numerical data, quantitative data.
  • 9. Exploring Numerical Data There exists two major mathematical plot methods to explore numerical data: •Box plot •Histogram
  • 10. Exploring Cont... Understanding Central tendency: For understanding the nature of data(Numeric variables) we need to apply measure of central tendency. Mean: It is the sum of all data values divided by the count of all data elements. Median: It is the middle value. Median splits the dataset in to half. Mode: It is the most frequently occuring value in the data set.
  • 11. Exploring Cont... Measuring the Dispersion of Data (Range, Quartiles, Interquartile Range): Let x1,x2....,xN be a set of observations for some numeric attribute, X. The range of the set is the difference between the largest(max()) and the smallest (min()) values. Quartiles: are points taken at regular intervals of data distribution, dividing it into essentially equal size consecutive sets. Interquartile range: The distance between the first and third quartiles is a measure of spread that gives the range covered by the middle half of the data.
  • 12. Variance and Standard Deviation These are measures of data dispersion. And it indicates that how spread out a data distribution is. A low standard deviation means that the data observations observations tend to be very close to the mean, while high high standard deviation indicates that the data are spread out spread out over a large range of values.
  • 13. Categorical Data Categorical data represents characteristics, such as a hockey player’s position, team, hometown .
  • 14. Time Series Data Time series data is a sequence of numbers collected at regular intervals over some period of time.
  • 15. Text Data Text data is basically just words.
  • 16. Relationship between variables Scatter-plots and two-way cross tabulation can be effectively used. Scatter- plots: a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.
  • 17. Relationship Cont... Two-way cross tabulation: It is also known as cross-tab, are used to understand the relationship of two categorical attributes in a concise way. It has a matrix format that presents a summarized view of the bivariate frequency distribution. It is much similar to scatter plot, helps to understand how much the data values of the attribute changes with the change in data values of another attributes.
  • 18. Data Issues Day by day we are generating tremendous amount of data. Dealing with big data is much more complicated. Real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources
  • 19. Issues cont... In accurate, incomplete, and inconsistent data are common- place properties of large real-world databases and warehouses. Main reasons for inaccurate data • Having incorrect attribute values. • The data collection instruments used may be faulty. • There may have been human or computer errors occurring at data entry.
  • 20. Issues cont... • Users may purposely submit incorrect data values for mandatory fields when they do not wish to submit personal information. • Errors in data transmission can also occur. • Inconsistent formats for input fields.
  • 21. Remedies Handling Outliers: Outliers are data elements with an abnormally high value which may impact prediction accuracy. •Remove outliers: If the outliers for the specific record is not many, simple way is to remove. •Imputation: impute the values with mean or median or mode. •Capping: For values that lie outside the 1.5|x|IQR limits, we can cap them by replacing those observations below the lower limit with the value of 5th percentile and those that lie above upper limit, with the value of 95th percentile.
  • 22. Remedies Cont... Handling Missing Values: • Eliminate records having a missing value of data elements. • Imputing missing values using mean/median/mode. • Fill the missing value manually. • Use the global constant to fill the missing value. • Use the most probable value to fill in the missing value.
  • 23. Major tasks in pre-processing Data cleaning: routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Data Integration: Integrating data from different sources
  • 25. Pre Processing Cont... Data Transformation: It is the process of converting data from one format to another. Data reduction: obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data reduction strategies include dimensionality reduction and numerosity reduction.
  • 26. Model Abstraction is a significant step as it represents raw input data in a summarized and structured format, such that a meaningful insight is obtained from the data. This structured representation of raw input data to the meaningful pattern is called a Model.
  • 27. Model Selection Models for supervised learning try to predict certain values using the input data set. Models for unsupervised learning used to describe a data set or gain insight from a data set.
  • 28. Model Training The process of assigning a model, and fitting a specific model to a data set is called model Training. Bias: If the outcome of a model is systematically incorrect, the learning is said to have a bias.
  • 29. Model Representation & Interpretability Fitness of a target function approximated by a learning algorithm determines how correctly it is able to classify a set of data it has never seen. Underfitting: If the target function is kept too simple, it may not be able to capture the essential nuances and represent the underlying data well. This is known as underfitting.
  • 30. Model Representation & Interpretability Cont... Overfitting: Where the model has been designed in such a way that it emulates the training data too closely. In such a case any specific nuance in the training data, like noise or outliers, gets embedded in the model. It adversely impacts the performance of the model on the test data.
  • 31. Model Representation & Interpretability Cont... Bias and Variance:(Supervised learning) Errors due to bias arise from simplifying assumptions made by the model whereas errors due to variance occur from over-aligning the model with the training data sets.
  • 32. Training a model. Model evaluation aims to estimate the generalization accuracy of a model on future data. There exists two methods for evaluating model's performance: • Holdout • Cross-validation
  • 33. Training a model Holdout: It tests a model on different data than it was trained on. In this method the data set is divided into three subsets: • Training set: is a subset of the dataset used to build predictive models. • Validation set: is a subset of the dataset used to assess the performance of the model built in the training phase.
  • 34. Training a model con... • Test set(unseen data): is a subset of the dataset used to assess the likely future performance of a model. The holdout approach is useful because of its speed, simplicity, and flexibility.
  • 35. Training a Model con.. Cross-Validation: It partitions the original observation dataset into a training set, used to train the model, and an independent set used to evaluate the analysis. The most common cross-validation technique is K-fold cross-validation, here original dataset is partitioned into k equal size subsamples, called folds.
  • 36. Training a Model con.. Bootstrap sampling: It is a popular way to identify training and test data sets from the input data set. It uses the technique of Simple Random Sampling with Replacement(SRSWR). Bootstrapping randomly picks data instances from the input data set, with the possibility of the same data instance to be picked multiple times.
  • 37. Evaluating performance of a model. Classification Accuracy: Accuracy is a common evaluation metric for classification problems. It's the number of correct predictions made as a ratio of all predictions made. Cross-Validation techniques can also be used to compare the performance of different machine learning models on the same data set and also be helpful in selecting the values for a model's parameters that maximize the accuracy of the model- also known as parameter tuning.
  • 38. Evaluating performance of a model. Confusion Matrix: It provides a more detailed breakdown of correct and incorrect classification for each class. Logarithmic Loss(logloss): measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Area under Curve(AUC): is a performance metric for measuring the ability of binary classifier to discriminate between positive and negative classes.
  • 39. Evaluating performance of a model. F-Measure: is a measure of a test's accuracy that considers both the precision and recall of the test to compute the score. Precision is the number of correct positive results divided by the total predicted positive observations. Recall is the number of positive results divided by the number of all relevant samples.
  • 40. Feature Engineering A feature is an attribute of a data set that is used in machine learning process. Feature engineering is an important pre-processing step for machine learning, having two major elements • Feature transformation • Feature sub-set selection
  • 41. Feature Engineering cont... Feature Transformation: It transforms data into a new set of features which can represent the underling machine learning problem. • Feature Construction • Feature Extraction Feature construction process discovers missing information about the relationships between features and augments.
  • 42. Feature Engineering cont... Feature Extraction: Is the process of extracting or creating a new set of features from the original set of features using some functional mapping. Examples: Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Linear Discriminant Analysis (LDA).