Lecture 08 prepare the data for ml algorithm

End-To-End Machine Learning Project
▪ Phase 1: Get data – Prepare Data
Dr. Mostafa A. Elhosseini
https://youtube.com/c/mostafaelhosseini
Mostafa A. Elhosseini https://youtube.com/c/mostafaelhosseini 1

AGENDA
• Working with Real Data
• California housing price datasets
• Look at the big picture
• Frame the problem
• Get the data
• Discover and visualize the data to
gain insights
▪ Prepare the data for machine
learning algorithms

Prepare the Data for ML Algorithms
Ꚛ It’s time to prepare the data for your Machine Learning algorithms.
Instead of just doing this manually, you should write functions to do
that, for several good reasons:
▪ This will allow you to reproduce these transformations easily on any dataset
▪ You will gradually build a library of transformation functions that you can
reuse in future projects.
▪ You can use these functions in your live system to transform the new data
before feeding it to your algorithms.
▪ This will make it possible for you to easily try various transformations and see
which combination of transformations works best.

Data Cleaning
Ꚛ Most Machine Learning algorithms cannot work with missing features, so
let’s create a few functions to take care of them.
▪ You noticed earlier that the total_bedrooms attribute has some missing values, so
let’s fix this.
Ꚛ You have three options:
▪ Get rid of the corresponding districts.
▪ Get rid of the whole attribute.
▪ Set the values to some value (zero, the mean, the median, etc.).
Ꚛ You can accomplish these easily using DataFrame’s dropna(), drop(), and
fillna() methods
Ꚛ Scikit-Learn provides a handy class to take care of missing values: Imputer

Handling Text and Categorical Attributes
Ꚛ Most Machine Learning algorithms prefer to work with numbers anyway, so let’s
convert these text labels to numbers.
Ꚛ Scikit-Learn provides a transformer for this task called LabelEncoder
Ꚛ One issue with this representation is that ML algorithms will assume that two
nearby values are more similar than two distant values
Ꚛ To fix this issue, a common solution is to create one binary attribute per
category: one attribute equal to 1 (and 0 otherwise)
▪ This is called one-hot encoding
Ꚛ Scikit-Learn provides a OneHotEncoder encoder to convert integer categorical
values into one-hot vectors
Ꚛ We can apply both transformations (from text categories to integer categories,
then from integer categories to one-hot vectors) in one shot using the
LabelBinarizer class

Custom Transformers
Ꚛ Although Scikit-Learn provides many useful transformers, you will need to
write your own for tasks such as custom cleanup operations or combining
specific attributes.
Ꚛ You will want your transformer to work seamlessly with Scikit-Learn
functionalities (such as pipelines)
Ꚛ hyperparameter will allow you to easily find out whether adding this
attribute helps the Machine Learning algorithms or not.
Ꚛ More generally, you can add a hyperparameter to gate any data
preparation step that you are not 100% sure about.
Ꚛ The more you automate these data preparation steps, the more
combinations you can automatically try out, making it much more likely
that you will find a great combination (and saving you a lot of time).

Feature Scaling
Ꚛ One of the most important transformations
you need to apply to your data is feature
scaling.
Ꚛ With few exceptions, Machine Learning
algorithms don’t perform well when the input
numerical attributes have very different scales
Ꚛ There are two common ways to get all
attributes to have the same scale: min-max
scaling and standardization
Ꚛ Min-max scaling (many people call this
normalization) is quite simple: values are
shifted and rescaled so that they end up
ranging from 0 to 1

Feature Scaling
Ꚛ Scikit-Learn provides a transformer called MinMaxScaler for this. It
has a feature_range hyperparameter that lets you change the range
if you don’t want 0–1 for some reason
Ꚛ Standardization is quite different: first it subtracts the mean value (so
standardized values always have a zero mean), and then it divides by
the standard deviation so that the resulting distribution has unit
variance.
▪ Unlike min-max scaling, standardization does not bound values to a specific
range, which may be a problem for some algorithms (NN often expect an
input ranging from 0 to 1)
▪ However standardization is much less affected by outliers
▪ Scikit-Learn provides a transformer called StandardScaler for standardization

Transformation Pipelines
Ꚛ As you can see, there are many data transformation steps that need
to be executed in the right order.
Ꚛ Fortunately, Scikit-Learn provides the Pipeline class to help with such
sequences of transformations
Ꚛ The Pipeline constructor takes a list of name/estimator pairs defining
a sequence of steps. All but the last estimator must be transformers

To conclude…
Ꚛ At last! You framed the problem,
Ꚛ you got the data and explored it,
Ꚛ you sampled a training set and a test set, and
Ꚛ you wrote transformation pipelines to clean up and prepare your
data for Machine Learning algorithms automatically.
Ꚛ You are now ready to select and train a Machine Learning model

Lecture 08 prepare the data for ml algorithm

Recommended

More Related Content

Similar to Lecture 08 prepare the data for ml algorithm (20)

More from Mostafa El-Hosseini (16)

Recently uploaded (20)

Lecture 08 prepare the data for ml algorithm