SlideShare a Scribd company logo
Start Machine learning programming in
5 simple steps
By Renjith M P
https://www.linkedin.com/in/renjith-m-p-bbb67860/
To start with machine learning, we need to follow five basic steps.
Steps
1. Choose a use case / problem statement :- Define your objective
2. Prepare data to train the system :- for any machine learning project, first your need to train
the system with some data
3. Choose a programming language and useful libraries for machine learning :- Yes, obviously
you need to choose a programming language to implement your machine learning
4. Training and prediction implementation :- Implement your solution using the programming
language that you have selected
5. Evaluate the result accuracy :- validate the results (Based on accuracy results, we could
accept the model or we could fine tune the model with various parameters and improve the
model until we get a satisfactory result )
Warning:
Target Audience : Basic knowledge on python (execute python scripts, install packages etc ) is
mandatory to follow this course.
Lets get into action. We will choose a use case and implement the machine learning for the same.
1. Choose a use case / problem statement
Usecase : Predict the species of iris flower based on the lengths and widths of sepals and
petals .
Iris setosa Iris versicolor Iris virginica
2. Prepare data to train the system
We will be using iris flower data set (https://en.wikipedia.org/wiki/Iris_flower_data_set )
which consist of 150 rows. Each row will have 5 columns
1. sepal length
2. sepal width
3. petal length
4.petal width
5.species of iris plant
out of 150 rows, only 120 rows will be used to train the model and rest will be used to
validate the accuracy of predictions.
3. Choose a programming language and libaries for machine learning
There are quite few options available however the famous once are R & Python.
My choice is Python. Unlike R, Python is a complete language and platform that you can
use for both research and development and to develop production systems
Ecosystem & Libraries
Machine learning needs plenty of numeric computations, data mining, algorithms and
plotting.
Python offers a few ecosystems and libraries for multiple functionalities.One of the
commonly used ecosystem is SciPy,which is a collection of open source software for
scientific computing in Python, which has many packages or libraries.
Out of that please find the list of packages from SciPy ecosystem,that we are going to use
Package Desciption
NumPy The fundamental package for numerical computation. It defines the
numerical array and matrix types and basic operations on them.
MatplotLib a mature and popular plotting package, that provides publication-
quality 2D plotting as well as rudimentary 3D plotting
SciPy Library One of the components of the SciPy stack, providing many numerical
routines
Pandas Providing high-performance, easy to use data structures
sklearn Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
4. Training, Prediction and validation implementation
4.1. Import libraries (before importing make sure you install them using pip/pip3)
4.2. Load data to train the model
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url =
"https://raw.githubusercontent.com/renjithmp/machinelearning/maste
r/python/usecases/1_irisflowers/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
#print important information about dataset
print(dataset.shape)
print (dataset.head(20))
print (dataset.describe())
print(dataset.groupby('class').size())
#visualize the data
dataset.plot(kind='box', subplots=True, layout=(2,2),
sharex=False, sharey=False)
plt.show()
Explanation :
dataset.plot() function :-
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the
distribution of data based on the five number summary: minimum, first quartile, median, third
quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the
third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and
"whiskers" above and below the box show the locations of the minimum and maximum.
In machine learning, it is important to analys the data using different parameters.
Visualize them using plot methods makes it much easier than analyze data in tabular format.
For our use case, we will get below plots for sepal,petal length’s and widths.
4.3. split the data for training and validation
Explanation :-
X_train – training data (120 rows consist of petal ,sepal lengths and widths)
Y_train – training data (120 rows consist of class of plant)
x_validate – validation data (30 rows conist of petal,sepal lengths and widths)
Y_train -validation data(30 rows consist of class of plant)
4.4.Train few models using training data
Lets use X_train and Y_train to train few models
models=[]
models.append(('LR',LogisticRegression()))
models.append(('LDA',LinearDiscriminantAnalysis()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('CART',DecisionTreeClassifier()))
models.append(('NB',GaussianNB()))
models.append(('SVM',SVC()))
array=dataset.values
X=array[:,0:4]
Y=array[:,4]
validation_size=0.20
seed=7
scoring='accuracy'
X_train,X_validation,Y_train,Y_validation=model_selection.train_t
est_split(X,Y,test_size=validation_size,random_state=seed)
The explanation of algorithms can be found @ scikit-learn.org e.g
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.ht
ml
I am not covering them here as it need a much deeper explanation. For now, we need to keep in mind that
a model is something that has the capability to learn by it self using the training data and predict the output
for future use cases
Explanation :
Kfold :- it is a very useful function to divide and shuffle the data in dataset.
Here we are dividing the data in to 10 equal parts.
Cross_val_score :– This is the most important step. We are feeding the model with training data (X_train
-input and Y_train -corresponding output ). The method will execute the model and provide accuracy for
each of the fold (remember we used 10 folds)
take the mean and std deviation of 10 fold’s to see the accuracy for the entire training set.
4.5. Choose the best model which seems to be more accurate
As you can see, we have executed 5 different models for the training data (5 different algorithms) and
results shows that (cv_results.mean())
KneighborsClassifier() gives the most accurate results (0.98 or 98 %)
4.6.Predict and validate the results using validation data set
results=[]
names=[]
for name,model in models:
kfold=model_selection.KFold(n_splits=10,random_state=seed)
cv_results=model_selection.cross_val_score(model,X_train,Y_train,c
v=kfold,scoring=scoring)
results.append(cv_results)
names.append(name)
msg="%s: %f (%f)" % (name,cv_results.mean(),cv_results.std())
print(msg)
knn=KNeighborsClassifier()
knn.fit(X_train,Y_train)
predictions=knn.predict(X_validation)
print(accuracy_score(Y_validation,predictions))
Lets choose KNN and find predict the output for validation data
5. Publish results
The accuracy_score() function can be used to see the accuracy of the prediction. In our use case we
can see an accuracy of 0.90 (90%)
You can find the source code here
https://github.com/renjithmp/machinelearning/blob/master/python/usecases/1_irisflowers/
flowerclassprediction.py
Reference
Jason Brownlee article
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
Scikit
https://scikit-learn.org

More Related Content

PPTX
Classification with Naive Bayes
PPTX
Demystifying Machine and Deep Learning for Developers
PPTX
Make Sense Out of Data with Feature Engineering
PPTX
Ml programming with python
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PPTX
Automated Machine Learning (Auto ML)
PDF
Winning Kaggle 101: Introduction to Stacking
PPTX
Tips and tricks to win kaggle data science competitions
Classification with Naive Bayes
Demystifying Machine and Deep Learning for Developers
Make Sense Out of Data with Feature Engineering
Ml programming with python
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Automated Machine Learning (Auto ML)
Winning Kaggle 101: Introduction to Stacking
Tips and tricks to win kaggle data science competitions

What's hot (20)

PPT
Machine Learning presentation.
PDF
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
DOCX
Feature extraction for classifying students based on theirac ademic performance
PDF
Object Oriented Programming Lab Manual
PPTX
How to Win Machine Learning Competitions ?
PPTX
Machine learning with scikitlearn
PPTX
10 R Packages to Win Kaggle Competitions
PDF
Object Oriented Programming in Matlab
PPTX
Tweets Classification using Naive Bayes and SVM
PDF
Using Optimal Learning to Tune Deep Learning Pipelines
PDF
Kaggle presentation
PPTX
Presentation on BornoNet Research Paper and Python Basics
PDF
machine learning
PDF
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
PDF
An introduction to Machine Learning
PPTX
Machine learning
PPTX
Presentation on supervised learning
PDF
Feature Engineering - Getting most out of data for predictive models
PPTX
Machine Learning Fundamentals
PPTX
Supervised Machine Learning in R
Machine Learning presentation.
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Feature extraction for classifying students based on theirac ademic performance
Object Oriented Programming Lab Manual
How to Win Machine Learning Competitions ?
Machine learning with scikitlearn
10 R Packages to Win Kaggle Competitions
Object Oriented Programming in Matlab
Tweets Classification using Naive Bayes and SVM
Using Optimal Learning to Tune Deep Learning Pipelines
Kaggle presentation
Presentation on BornoNet Research Paper and Python Basics
machine learning
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
An introduction to Machine Learning
Machine learning
Presentation on supervised learning
Feature Engineering - Getting most out of data for predictive models
Machine Learning Fundamentals
Supervised Machine Learning in R
Ad

Similar to Start machine learning in 5 simple steps (20)

PPTX
Lecture-6-7.pptx
PDF
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
PPTX
Machine Learning for .NET Developers - ADC21
PDF
The ABC of Implementing Supervised Machine Learning with Python.pptx
PPTX
House price prediction
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
Employee Salary Presentation.l based on data science collection of data
PDF
Analysis using r
DOCX
employee turnover prediction document.docx
PDF
Computer Tools for Academic Research
PPTX
Machine Learning - Simple Linear Regression
PDF
Workshop: Your first machine learning project
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PPTX
IMDB Movie Reviews made by any organisation.pptx
PPTX
B4UConference_machine learning_deeplearning
PDF
DS LAB MANUAL.pdf
PPTX
Azure machine learning service
PDF
Inteligencia artificial para android como empezar
PDF
Viktor Tsykunov: Azure Machine Learning Service
PDF
Cse 7th-sem-machine-learning-laboratory-csml1819
Lecture-6-7.pptx
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
Machine Learning for .NET Developers - ADC21
The ABC of Implementing Supervised Machine Learning with Python.pptx
House price prediction
Introduction to Machine Learning with SciKit-Learn
Employee Salary Presentation.l based on data science collection of data
Analysis using r
employee turnover prediction document.docx
Computer Tools for Academic Research
Machine Learning - Simple Linear Regression
Workshop: Your first machine learning project
A Hands-on Intro to Data Science and R Presentation.ppt
IMDB Movie Reviews made by any organisation.pptx
B4UConference_machine learning_deeplearning
DS LAB MANUAL.pdf
Azure machine learning service
Inteligencia artificial para android como empezar
Viktor Tsykunov: Azure Machine Learning Service
Cse 7th-sem-machine-learning-laboratory-csml1819
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
Lecture1 pattern recognition............
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Transcultural that can help you someday.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Business Analytics and business intelligence.pdf
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Introduction to Data Science and Data Analysis
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Lecture1 pattern recognition............
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
importance of Data-Visualization-in-Data-Science. for mba studnts
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Microsoft Core Cloud Services powerpoint
Transcultural that can help you someday.
Galatica Smart Energy Infrastructure Startup Pitch Deck
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
DATA COLLECTION METHODS-ppt for nursing research
Business Analytics and business intelligence.pdf
Database Infoormation System (DBIS).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Data Science and Data Analysis

Start machine learning in 5 simple steps

  • 1. Start Machine learning programming in 5 simple steps By Renjith M P https://www.linkedin.com/in/renjith-m-p-bbb67860/ To start with machine learning, we need to follow five basic steps. Steps 1. Choose a use case / problem statement :- Define your objective 2. Prepare data to train the system :- for any machine learning project, first your need to train the system with some data 3. Choose a programming language and useful libraries for machine learning :- Yes, obviously you need to choose a programming language to implement your machine learning 4. Training and prediction implementation :- Implement your solution using the programming language that you have selected 5. Evaluate the result accuracy :- validate the results (Based on accuracy results, we could accept the model or we could fine tune the model with various parameters and improve the model until we get a satisfactory result ) Warning: Target Audience : Basic knowledge on python (execute python scripts, install packages etc ) is mandatory to follow this course. Lets get into action. We will choose a use case and implement the machine learning for the same. 1. Choose a use case / problem statement Usecase : Predict the species of iris flower based on the lengths and widths of sepals and petals . Iris setosa Iris versicolor Iris virginica
  • 2. 2. Prepare data to train the system We will be using iris flower data set (https://en.wikipedia.org/wiki/Iris_flower_data_set ) which consist of 150 rows. Each row will have 5 columns 1. sepal length 2. sepal width 3. petal length 4.petal width 5.species of iris plant out of 150 rows, only 120 rows will be used to train the model and rest will be used to validate the accuracy of predictions. 3. Choose a programming language and libaries for machine learning There are quite few options available however the famous once are R & Python. My choice is Python. Unlike R, Python is a complete language and platform that you can use for both research and development and to develop production systems Ecosystem & Libraries Machine learning needs plenty of numeric computations, data mining, algorithms and plotting. Python offers a few ecosystems and libraries for multiple functionalities.One of the commonly used ecosystem is SciPy,which is a collection of open source software for scientific computing in Python, which has many packages or libraries. Out of that please find the list of packages from SciPy ecosystem,that we are going to use Package Desciption NumPy The fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them. MatplotLib a mature and popular plotting package, that provides publication- quality 2D plotting as well as rudimentary 3D plotting SciPy Library One of the components of the SciPy stack, providing many numerical routines Pandas Providing high-performance, easy to use data structures sklearn Simple and efficient tools for data mining and data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib
  • 3. 4. Training, Prediction and validation implementation 4.1. Import libraries (before importing make sure you install them using pip/pip3) 4.2. Load data to train the model import pandas import matplotlib.pyplot as plt from sklearn import model_selection from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/renjithmp/machinelearning/maste r/python/usecases/1_irisflowers/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = pandas.read_csv(url, names=names) #print important information about dataset print(dataset.shape) print (dataset.head(20)) print (dataset.describe()) print(dataset.groupby('class').size()) #visualize the data dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) plt.show()
  • 4. Explanation : dataset.plot() function :- The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers" above and below the box show the locations of the minimum and maximum. In machine learning, it is important to analys the data using different parameters. Visualize them using plot methods makes it much easier than analyze data in tabular format. For our use case, we will get below plots for sepal,petal length’s and widths.
  • 5. 4.3. split the data for training and validation Explanation :- X_train – training data (120 rows consist of petal ,sepal lengths and widths) Y_train – training data (120 rows consist of class of plant) x_validate – validation data (30 rows conist of petal,sepal lengths and widths) Y_train -validation data(30 rows consist of class of plant) 4.4.Train few models using training data Lets use X_train and Y_train to train few models models=[] models.append(('LR',LogisticRegression())) models.append(('LDA',LinearDiscriminantAnalysis())) models.append(('KNN',KNeighborsClassifier())) models.append(('CART',DecisionTreeClassifier())) models.append(('NB',GaussianNB())) models.append(('SVM',SVC())) array=dataset.values X=array[:,0:4] Y=array[:,4] validation_size=0.20 seed=7 scoring='accuracy' X_train,X_validation,Y_train,Y_validation=model_selection.train_t est_split(X,Y,test_size=validation_size,random_state=seed)
  • 6. The explanation of algorithms can be found @ scikit-learn.org e.g https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.ht ml I am not covering them here as it need a much deeper explanation. For now, we need to keep in mind that a model is something that has the capability to learn by it self using the training data and predict the output for future use cases Explanation : Kfold :- it is a very useful function to divide and shuffle the data in dataset. Here we are dividing the data in to 10 equal parts. Cross_val_score :– This is the most important step. We are feeding the model with training data (X_train -input and Y_train -corresponding output ). The method will execute the model and provide accuracy for each of the fold (remember we used 10 folds) take the mean and std deviation of 10 fold’s to see the accuracy for the entire training set. 4.5. Choose the best model which seems to be more accurate As you can see, we have executed 5 different models for the training data (5 different algorithms) and results shows that (cv_results.mean()) KneighborsClassifier() gives the most accurate results (0.98 or 98 %) 4.6.Predict and validate the results using validation data set results=[] names=[] for name,model in models: kfold=model_selection.KFold(n_splits=10,random_state=seed) cv_results=model_selection.cross_val_score(model,X_train,Y_train,c v=kfold,scoring=scoring) results.append(cv_results) names.append(name) msg="%s: %f (%f)" % (name,cv_results.mean(),cv_results.std()) print(msg) knn=KNeighborsClassifier() knn.fit(X_train,Y_train) predictions=knn.predict(X_validation) print(accuracy_score(Y_validation,predictions))
  • 7. Lets choose KNN and find predict the output for validation data 5. Publish results The accuracy_score() function can be used to see the accuracy of the prediction. In our use case we can see an accuracy of 0.90 (90%) You can find the source code here https://github.com/renjithmp/machinelearning/blob/master/python/usecases/1_irisflowers/ flowerclassprediction.py Reference Jason Brownlee article https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ Scikit https://scikit-learn.org