SlideShare a Scribd company logo
End-To-End Machine Learning Project
▪ Phase 1: Get data – Prepare Data
Dr. Mostafa A. Elhosseini
https://youtube.com/c/mostafaelhosseini
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 1
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 2
AGENDA
• Working with Real Data
• California housing price datasets
• Look at the big picture
• Frame the problem
• Get the data
• Discover and visualize the data to
gain insights
▪ Prepare the data for machine
learning algorithms
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 3
Prepare the Data for ML Algorithms
Ꚛ It’s time to prepare the data for your Machine Learning algorithms.
Instead of just doing this manually, you should write functions to do
that, for several good reasons:
▪ This will allow you to reproduce these transformations easily on any dataset
▪ You will gradually build a library of transformation functions that you can
reuse in future projects.
▪ You can use these functions in your live system to transform the new data
before feeding it to your algorithms.
▪ This will make it possible for you to easily try various transformations and see
which combination of transformations works best.
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 4
Data Cleaning
Ꚛ Most Machine Learning algorithms cannot work with missing features, so
let’s create a few functions to take care of them.
▪ You noticed earlier that the total_bedrooms attribute has some missing values, so
let’s fix this.
Ꚛ You have three options:
▪ Get rid of the corresponding districts.
▪ Get rid of the whole attribute.
▪ Set the values to some value (zero, the mean, the median, etc.).
Ꚛ You can accomplish these easily using DataFrame’s dropna(), drop(), and
fillna() methods
Ꚛ Scikit-Learn provides a handy class to take care of missing values: Imputer
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 5
Handling Text and Categorical Attributes
Ꚛ Most Machine Learning algorithms prefer to work with numbers anyway, so let’s
convert these text labels to numbers.
Ꚛ Scikit-Learn provides a transformer for this task called LabelEncoder
Ꚛ One issue with this representation is that ML algorithms will assume that two
nearby values are more similar than two distant values
Ꚛ To fix this issue, a common solution is to create one binary attribute per
category: one attribute equal to 1 (and 0 otherwise)
▪ This is called one-hot encoding
Ꚛ Scikit-Learn provides a OneHotEncoder encoder to convert integer categorical
values into one-hot vectors
Ꚛ We can apply both transformations (from text categories to integer categories,
then from integer categories to one-hot vectors) in one shot using the
LabelBinarizer class
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 6
Custom Transformers
Ꚛ Although Scikit-Learn provides many useful transformers, you will need to
write your own for tasks such as custom cleanup operations or combining
specific attributes.
Ꚛ You will want your transformer to work seamlessly with Scikit-Learn
functionalities (such as pipelines)
Ꚛ hyperparameter will allow you to easily find out whether adding this
attribute helps the Machine Learning algorithms or not.
Ꚛ More generally, you can add a hyperparameter to gate any data
preparation step that you are not 100% sure about.
Ꚛ The more you automate these data preparation steps, the more
combinations you can automatically try out, making it much more likely
that you will find a great combination (and saving you a lot of time).
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 7
Feature Scaling
Ꚛ One of the most important transformations
you need to apply to your data is feature
scaling.
Ꚛ With few exceptions, Machine Learning
algorithms don’t perform well when the input
numerical attributes have very different scales
Ꚛ There are two common ways to get all
attributes to have the same scale: min-max
scaling and standardization
Ꚛ Min-max scaling (many people call this
normalization) is quite simple: values are
shifted and rescaled so that they end up
ranging from 0 to 1
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 8
Feature Scaling
Ꚛ Scikit-Learn provides a transformer called MinMaxScaler for this. It
has a feature_range hyperparameter that lets you change the range
if you don’t want 0–1 for some reason
Ꚛ Standardization is quite different: first it subtracts the mean value (so
standardized values always have a zero mean), and then it divides by
the standard deviation so that the resulting distribution has unit
variance.
▪ Unlike min-max scaling, standardization does not bound values to a specific
range, which may be a problem for some algorithms (NN often expect an
input ranging from 0 to 1)
▪ However standardization is much less affected by outliers
▪ Scikit-Learn provides a transformer called StandardScaler for standardization
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 9
Transformation Pipelines
Ꚛ As you can see, there are many data transformation steps that need
to be executed in the right order.
Ꚛ Fortunately, Scikit-Learn provides the Pipeline class to help with such
sequences of transformations
Ꚛ The Pipeline constructor takes a list of name/estimator pairs defining
a sequence of steps. All but the last estimator must be transformers
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 10
To conclude…
Ꚛ At last! You framed the problem,
Ꚛ you got the data and explored it,
Ꚛ you sampled a training set and a test set, and
Ꚛ you wrote transformation pipelines to clean up and prepare your
data for Machine Learning algorithms automatically.
Ꚛ You are now ready to select and train a Machine Learning model
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 11
Ad

Recommended

PDF
Lecture 24 support vector machine kernel
Mostafa El-Hosseini
 
PDF
Lecture 23 support vector classifier
Mostafa El-Hosseini
 
PDF
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
tdc-globalcode
 
PDF
Building a performing Machine Learning model from A to Z
Charles Vestur
 
PPTX
Data Preprocessing
zekeLabs Technologies
 
PDF
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
PPTX
This notes are more beneficial for artifical intelligence
ghulammuhammad83506
 
PDF
Hands_On_Machine_Learning_with_Scikit_Le.pdf
Shems192009
 
PDF
Feature engineering pipelines
Ramesh Sampath
 
PDF
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
PDF
An introduction to Machine Learning
Valéry BERNARD
 
PPTX
End-to-End Machine Learning Project
Eng Teong Cheah
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
Scaling Deep Learning with MXNet
AI Frontiers
 
PPTX
machine learning workflow with data input.pptx
jasontseng19
 
PDF
Kaggle presentation
HJ van Veen
 
PDF
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
PPTX
Chapter 6 Preparing Data for Machine Learning.pptx
TngNguynSn19
 
PDF
Visualizing the Model Selection Process
Benjamin Bengfort
 
PDF
Introduction Machine Learning by MyLittleAdventure
mylittleadventure
 
PPTX
Lec 02-03 Machine learning understanding key concepts.pptx
AttaMohammadPanhyar
 
PPTX
House price prediction
SabahBegum
 
PDF
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
PDF
Nose Dive into Apache Spark ML
Ahmet Bulut
 
PDF
why now Deep Neural Networks?
Mostafa El-Hosseini
 
PDF
Activation functions types
Mostafa El-Hosseini
 

More Related Content

Similar to Lecture 08 prepare the data for ml algorithm (20)

PDF
Feature engineering pipelines
Ramesh Sampath
 
PDF
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
PDF
An introduction to Machine Learning
Valéry BERNARD
 
PPTX
End-to-End Machine Learning Project
Eng Teong Cheah
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
Scaling Deep Learning with MXNet
AI Frontiers
 
PPTX
machine learning workflow with data input.pptx
jasontseng19
 
PDF
Kaggle presentation
HJ van Veen
 
PDF
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
PPTX
Chapter 6 Preparing Data for Machine Learning.pptx
TngNguynSn19
 
PDF
Visualizing the Model Selection Process
Benjamin Bengfort
 
PDF
Introduction Machine Learning by MyLittleAdventure
mylittleadventure
 
PPTX
Lec 02-03 Machine learning understanding key concepts.pptx
AttaMohammadPanhyar
 
PPTX
House price prediction
SabahBegum
 
PDF
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
PDF
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Feature engineering pipelines
Ramesh Sampath
 
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
An introduction to Machine Learning
Valéry BERNARD
 
End-to-End Machine Learning Project
Eng Teong Cheah
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
Scaling Deep Learning with MXNet
AI Frontiers
 
machine learning workflow with data input.pptx
jasontseng19
 
Kaggle presentation
HJ van Veen
 
Main principles of Data Science and Machine Learning
Nikolay Karelin
 
Chapter 6 Preparing Data for Machine Learning.pptx
TngNguynSn19
 
Visualizing the Model Selection Process
Benjamin Bengfort
 
Introduction Machine Learning by MyLittleAdventure
mylittleadventure
 
Lec 02-03 Machine learning understanding key concepts.pptx
AttaMohammadPanhyar
 
House price prediction
SabahBegum
 
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Nose Dive into Apache Spark ML
Ahmet Bulut
 

More from Mostafa El-Hosseini (16)

PDF
why now Deep Neural Networks?
Mostafa El-Hosseini
 
PDF
Activation functions types
Mostafa El-Hosseini
 
PDF
Why activation function
Mostafa El-Hosseini
 
PDF
Logistic Regression (Binary Classification)
Mostafa El-Hosseini
 
PDF
Model validation and_early_stopping_-_shooting
Mostafa El-Hosseini
 
PDF
Lecture 01 _perceptron_intro
Mostafa El-Hosseini
 
PDF
Lecture 19 chapter_4_regularized_linear_models
Mostafa El-Hosseini
 
PDF
Svm rbf kernel
Mostafa El-Hosseini
 
PDF
Lecture 12 binary classifier confusion matrix
Mostafa El-Hosseini
 
PDF
Lecture 11 linear regression
Mostafa El-Hosseini
 
PDF
Numpy 02
Mostafa El-Hosseini
 
PDF
Naive bayes classifier python session
Mostafa El-Hosseini
 
PDF
Numpy 01
Mostafa El-Hosseini
 
PDF
Lecture 02 ml supervised and unsupervised
Mostafa El-Hosseini
 
PDF
Lecture 01 intro. to ml and overview
Mostafa El-Hosseini
 
why now Deep Neural Networks?
Mostafa El-Hosseini
 
Activation functions types
Mostafa El-Hosseini
 
Why activation function
Mostafa El-Hosseini
 
Logistic Regression (Binary Classification)
Mostafa El-Hosseini
 
Model validation and_early_stopping_-_shooting
Mostafa El-Hosseini
 
Lecture 01 _perceptron_intro
Mostafa El-Hosseini
 
Lecture 19 chapter_4_regularized_linear_models
Mostafa El-Hosseini
 
Svm rbf kernel
Mostafa El-Hosseini
 
Lecture 12 binary classifier confusion matrix
Mostafa El-Hosseini
 
Lecture 11 linear regression
Mostafa El-Hosseini
 
Naive bayes classifier python session
Mostafa El-Hosseini
 
Lecture 02 ml supervised and unsupervised
Mostafa El-Hosseini
 
Lecture 01 intro. to ml and overview
Mostafa El-Hosseini
 
Ad

Recently uploaded (20)

PDF
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
PDF
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
PPTX
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
PPTX
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
PPTX
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
PDF
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
PPTX
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PDF
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
PDF
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
PDF
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
PPTX
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
PDF
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
PDF
Complete University of Calculus :: 2nd edition
Shabista Imam
 
PDF
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
PDF
special_edition_using_visual_foxpro_6.pdf
Shabista Imam
 
PDF
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
PPTX
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
PDF
Modern multi-proposer consensus implementations
François Garillot
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
Complete University of Calculus :: 2nd edition
Shabista Imam
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
special_edition_using_visual_foxpro_6.pdf
Shabista Imam
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Modern multi-proposer consensus implementations
François Garillot
 
Ad

Lecture 08 prepare the data for ml algorithm

  • 1. End-To-End Machine Learning Project ▪ Phase 1: Get data – Prepare Data Dr. Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 1
  • 2. Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 2
  • 3. AGENDA • Working with Real Data • California housing price datasets • Look at the big picture • Frame the problem • Get the data • Discover and visualize the data to gain insights ▪ Prepare the data for machine learning algorithms Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 3
  • 4. Prepare the Data for ML Algorithms Ꚛ It’s time to prepare the data for your Machine Learning algorithms. Instead of just doing this manually, you should write functions to do that, for several good reasons: ▪ This will allow you to reproduce these transformations easily on any dataset ▪ You will gradually build a library of transformation functions that you can reuse in future projects. ▪ You can use these functions in your live system to transform the new data before feeding it to your algorithms. ▪ This will make it possible for you to easily try various transformations and see which combination of transformations works best. Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 4
  • 5. Data Cleaning Ꚛ Most Machine Learning algorithms cannot work with missing features, so let’s create a few functions to take care of them. ▪ You noticed earlier that the total_bedrooms attribute has some missing values, so let’s fix this. Ꚛ You have three options: ▪ Get rid of the corresponding districts. ▪ Get rid of the whole attribute. ▪ Set the values to some value (zero, the mean, the median, etc.). Ꚛ You can accomplish these easily using DataFrame’s dropna(), drop(), and fillna() methods Ꚛ Scikit-Learn provides a handy class to take care of missing values: Imputer Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 5
  • 6. Handling Text and Categorical Attributes Ꚛ Most Machine Learning algorithms prefer to work with numbers anyway, so let’s convert these text labels to numbers. Ꚛ Scikit-Learn provides a transformer for this task called LabelEncoder Ꚛ One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values Ꚛ To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 (and 0 otherwise) ▪ This is called one-hot encoding Ꚛ Scikit-Learn provides a OneHotEncoder encoder to convert integer categorical values into one-hot vectors Ꚛ We can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 6
  • 7. Custom Transformers Ꚛ Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. Ꚛ You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines) Ꚛ hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not. Ꚛ More generally, you can add a hyperparameter to gate any data preparation step that you are not 100% sure about. Ꚛ The more you automate these data preparation steps, the more combinations you can automatically try out, making it much more likely that you will find a great combination (and saving you a lot of time). Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 7
  • 8. Feature Scaling Ꚛ One of the most important transformations you need to apply to your data is feature scaling. Ꚛ With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales Ꚛ There are two common ways to get all attributes to have the same scale: min-max scaling and standardization Ꚛ Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1 Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 8
  • 9. Feature Scaling Ꚛ Scikit-Learn provides a transformer called MinMaxScaler for this. It has a feature_range hyperparameter that lets you change the range if you don’t want 0–1 for some reason Ꚛ Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance. ▪ Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (NN often expect an input ranging from 0 to 1) ▪ However standardization is much less affected by outliers ▪ Scikit-Learn provides a transformer called StandardScaler for standardization Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 9
  • 10. Transformation Pipelines Ꚛ As you can see, there are many data transformation steps that need to be executed in the right order. Ꚛ Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations Ꚛ The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 10
  • 11. To conclude… Ꚛ At last! You framed the problem, Ꚛ you got the data and explored it, Ꚛ you sampled a training set and a test set, and Ꚛ you wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. Ꚛ You are now ready to select and train a Machine Learning model Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 11