SlideShare a Scribd company logo
Practical Predictive
Modeling in Python
Robert Dempsey
robertwdempsey.com
Robert Dempsey
robertwdempsey
rdempsey
rdempsey
robertwdempsey.com
pythonbicookbook.com
Doing All Things In SQL
Makes Panda sad and confused
Each New Thing You Learn
Leads to another new thing to learn, and another, and…
So Many Things
1. Which predictive modeling technique to use
2. How to get the data into a format for modeling
3. How to ensure the “right” data is being used
4. How to feed the data into the model
5. How to validate the model results
6. How to save the model to use in production
7. How to implement the model in production and apply it to new observations
8. How to save the new predictions
9. How to ensure, over time, that the model is correctly predicting outcomes
10.How to later update the model with new training data
Practical Predictive Modeling in Python
Choose Your Model
Model Selection
• How much data do you have?
• Are you predicting a category? A quantity?
• Do you have labeled data?
• Do you know the number of categories?
• How much data do you have?
Regression
• Used for estimating the relationships among
variables
• Use when:
• Predicting a quantity
• More than 50 samples
Classification
• Used to answer “what is this object”
• Use when:
• Predicting a category
• Have labeled data
Clustering
• Used to group similar objects
• Use when:
• Predicting a category
• Don’t have labeled data
• Number of categories is known or unknown
• Have more than 50 samples
Dimensionality Reduction
• Process for reducing the number of random
variables under consideration (feature selection
and feature extraction)
• Use when:
• Not predicting a category or a quantity
• Just looking around
Model Selection
http://scikit-learn.org/stable/tutorial/machine_learning_map/
Format Thine Data
Format The Data
• Pandas FTW!
• Use the map() function to convert any text to a
number
• Fill in any missing values
• Split the data into features (the data) and targets
(the outcome to predict) using .values on the
DataFrame
map()
def update_failure_explanations(type):
if type == 'dob':
return 0
elif type == 'name':
return 1
elif type == 'ssn dob name':
return 2
elif type == 'ssn':
return 3
elif type == 'ssn name':
return 4
elif type == 'ssn dob':
return 5
elif type == 'dob name':
return 6
Fill In Missing Values
df.my_field.fillna(‘Missing', inplace=True)
df.fillna(0, inplace=True)
Split the Data
t_data = raw_data.iloc[:,0:22].values
1. Create a matrix of values
t_targets = raw_data['verified'].values
2. Create a matrix of targets
Get the (Right) Data
Get The Right Data
• This is called “Feature selection”
• Univariate feature selection
• SelectKBest removes all but the k highest scoring features
• SelectPercentile removes all but a user-specified highest scoring
percentage of features using common univariate statistical tests for
each feature: false positive rate
• SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
• GenericUnivariateSelect allows to perform univariate feature selection
with a configurable strategy.
http://scikit-learn.org/stable/modules/feature_selection.html
Feed Your Model
Data => Model
1. Build the model
http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn import linear_model
logClassifier = linear_model.LogisticRegression(C=1,
random_state=111)
2. Train the model
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(the_data,
the_targets,
cv=12,
test_size=0.20,
random_state=111)
logClassifier.fit(X_train, y_train)
Validate That!
Validation
1. Accuracy Score
http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn import metrics
metrics.accuracy_score(y_test, predicted)
2. Confusion Matrix
metrics.confusion_matrix(y_test, predicted)
Save Your Model
Save the Model
Pickle it!
https://docs.python.org/3/library/pickle.html
import pickle
model_file = "/lr_classifier_09.29.15.dat"
pickle.dump(logClassifier, open(model_file, "wb"))
Did it work?
logClassifier2 = pickle.load(open(model, "rb"))
print(logClassifier2)
Ship It
Implement in Production
• Clean the data the same way you did for the model
• Feature mappings
• Column re-ordering
• Create a function that returns the prediction
• Deserialize the model from the file you created
• Feed the model the data in the same order
• Call .predict() and get your answer
Example
def verify_record(record_scores):
# Reload the trained model
tif = "models/t_lr_classifier_07.28.15.dat"
log_classifier = pickle.load(open(tcf, "rb"))
# Return the prediction
return log_classifier.predict(record_scores)[0]
Save The Predictions
Save Your Predictions
As you would any other piece of data
(Keep) Getting it Right
Unleash the minion army!
… or get more creative
Update It
Be Smart
Train it again, but with validated predictions
Review
Step Review
1. Select a predictive modeling technique to use
2. Get the data into a format for modeling
3. Ensure the “right” data is being used
4. Feed the data into the model
5. Validate the model results
Step Review
6. Save the model to use in production
7. Implement the model in production and apply it to
new observations
8. Save the new predictions
9. Ensure the model is correctly predicting outcomes
over time
10. Update the model with new training data
pythonbicookbook.com
Robert Dempsey
robertwdempsey
rdempsey
rdempsey
robertwdempsey.com
Image Credits
• Format: https://www.flickr.com/photos/zaqography/3835692243/
• Get right data: https://www.flickr.com/photos/encouragement/14759554777/
• Feed: https://www.flickr.com/photos/glutnix/4291194/
• Validate: https://www.flickr.com/photos/lord-jim/16827236591/
• Save: http://www.cnn.com/2015/09/13/living/candice-swanepoel-victorias-secret-model-falls-feat/
• Ship It: https://www.flickr.com/photos/oneeighteen/15492277272/
• Save Predictions: https://www.flickr.com/photos/eelssej_/486414113/
• Get it right: https://www.flickr.com/photos/clickflashphotos/3402287993/
• Update it: https://www.flickr.com/photos/dullhunk/5497202855/
• Review: https://www.flickr.com/photos/pluggedmind/10714537023/
Ad

Recommended

C# basics
C# basics
Dinesh kumar
 
MYSQL-Database
MYSQL-Database
V.V.Vanniaperumal College for Women
 
Introduction to Objective - C
Introduction to Objective - C
Asim Rais Siddiqui
 
Decision making and loop in C#
Decision making and loop in C#
Prasanna Kumar SM
 
ArrayList in JAVA
ArrayList in JAVA
SAGARDAVE29
 
Lodash js
Lodash js
LearningTech
 
C# conventions & good practices
C# conventions & good practices
Tan Tran
 
Working with Methods in Java.pptx
Working with Methods in Java.pptx
maryansagsgao
 
Function Pointer
Function Pointer
Dr-Dipali Meher
 
PL/SQL - CURSORS
PL/SQL - CURSORS
IshaRana14
 
Functional programming in Scala
Functional programming in Scala
datamantra
 
C# - Part 1
C# - Part 1
Md. Mahedee Hasan
 
Variable and constants in Vb.NET
Variable and constants in Vb.NET
Jaya Kumari
 
JavaScript Looping Statements
JavaScript Looping Statements
Janssen Harvey Insigne
 
C# coding standards, good programming principles & refactoring
C# coding standards, good programming principles & refactoring
Eyob Lube
 
SQLITE Android
SQLITE Android
Sourabh Sahu
 
Multiprocessing with python
Multiprocessing with python
Patrick Vergain
 
MySQL and its basic commands
MySQL and its basic commands
Bwsrang Basumatary
 
Swift Introduction
Swift Introduction
Savvycom Savvycom
 
Array Of Pointers
Array Of Pointers
Sharad Dubey
 
Java script
Java script
Abhishek Kesharwani
 
MYSQL.ppt
MYSQL.ppt
webhostingguy
 
Data Analysis with Python Pandas
Data Analysis with Python Pandas
Neeru Mittal
 
Javascript
Javascript
guest03a6e6
 
DOT Net overview
DOT Net overview
chandrasekhardesireddi
 
Data visualization using R
Data visualization using R
Ummiya Mohammedi
 
Data Structures in Python
Data Structures in Python
Devashish Kumar
 
Feature scaling
Feature scaling
Gautam Kumar
 
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Robert Dempsey
 
Cam cloud assisted privacy preserving mobile health monitoring
Cam cloud assisted privacy preserving mobile health monitoring
IEEEFINALYEARPROJECTS
 

More Related Content

What's hot (20)

Function Pointer
Function Pointer
Dr-Dipali Meher
 
PL/SQL - CURSORS
PL/SQL - CURSORS
IshaRana14
 
Functional programming in Scala
Functional programming in Scala
datamantra
 
C# - Part 1
C# - Part 1
Md. Mahedee Hasan
 
Variable and constants in Vb.NET
Variable and constants in Vb.NET
Jaya Kumari
 
JavaScript Looping Statements
JavaScript Looping Statements
Janssen Harvey Insigne
 
C# coding standards, good programming principles & refactoring
C# coding standards, good programming principles & refactoring
Eyob Lube
 
SQLITE Android
SQLITE Android
Sourabh Sahu
 
Multiprocessing with python
Multiprocessing with python
Patrick Vergain
 
MySQL and its basic commands
MySQL and its basic commands
Bwsrang Basumatary
 
Swift Introduction
Swift Introduction
Savvycom Savvycom
 
Array Of Pointers
Array Of Pointers
Sharad Dubey
 
Java script
Java script
Abhishek Kesharwani
 
MYSQL.ppt
MYSQL.ppt
webhostingguy
 
Data Analysis with Python Pandas
Data Analysis with Python Pandas
Neeru Mittal
 
Javascript
Javascript
guest03a6e6
 
DOT Net overview
DOT Net overview
chandrasekhardesireddi
 
Data visualization using R
Data visualization using R
Ummiya Mohammedi
 
Data Structures in Python
Data Structures in Python
Devashish Kumar
 
Feature scaling
Feature scaling
Gautam Kumar
 

Viewers also liked (16)

Creating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Robert Dempsey
 
Cam cloud assisted privacy preserving mobile health monitoring
Cam cloud assisted privacy preserving mobile health monitoring
IEEEFINALYEARPROJECTS
 
Cloud assisted mobile-access of health data with privacy and auditability
Cloud assisted mobile-access of health data with privacy and auditability
IGEEKS TECHNOLOGIES
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
Sri Ambati
 
A Predictive Model Factory Picks Up Steam
A Predictive Model Factory Picks Up Steam
Sri Ambati
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Presentation of the unbalanced R package
Presentation of the unbalanced R package
Andrea Dal Pozzolo
 
Sach sentence completion
Sach sentence completion
EyeFrani
 
Getting Started with Deep Learning using Scala
Getting Started with Deep Learning using Scala
Taisuke Oe
 
Predicting Customer Long Term Value at Eni Belgium
Predicting Customer Long Term Value at Eni Belgium
Python Predictions
 
501 sentence completion questions
501 sentence completion questions
Nguyen Phan
 
Sentence completion test
Sentence completion test
Marie Faith Cayas
 
Objective Type Tests: Completion and Short - Answer Items
Objective Type Tests: Completion and Short - Answer Items
Mr. Ronald Quileste, PhD
 
Sack s sentence completion test report
Sack s sentence completion test report
Greg Emmanuel Villahermosa
 
Harnessing and securing cloud in patient health monitoring
Harnessing and securing cloud in patient health monitoring
Ashok Rangaswamy
 
Design and Drawing of CAM profiles
Design and Drawing of CAM profiles
Hareesha N Gowda, Dayananda Sagar College of Engg, Bangalore
 
Creating Your First Predictive Model In Python
Creating Your First Predictive Model In Python
Robert Dempsey
 
Cam cloud assisted privacy preserving mobile health monitoring
Cam cloud assisted privacy preserving mobile health monitoring
IEEEFINALYEARPROJECTS
 
Cloud assisted mobile-access of health data with privacy and auditability
Cloud assisted mobile-access of health data with privacy and auditability
IGEEKS TECHNOLOGIES
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
Sri Ambati
 
A Predictive Model Factory Picks Up Steam
A Predictive Model Factory Picks Up Steam
Sri Ambati
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Presentation of the unbalanced R package
Presentation of the unbalanced R package
Andrea Dal Pozzolo
 
Sach sentence completion
Sach sentence completion
EyeFrani
 
Getting Started with Deep Learning using Scala
Getting Started with Deep Learning using Scala
Taisuke Oe
 
Predicting Customer Long Term Value at Eni Belgium
Predicting Customer Long Term Value at Eni Belgium
Python Predictions
 
501 sentence completion questions
501 sentence completion questions
Nguyen Phan
 
Objective Type Tests: Completion and Short - Answer Items
Objective Type Tests: Completion and Short - Answer Items
Mr. Ronald Quileste, PhD
 
Harnessing and securing cloud in patient health monitoring
Harnessing and securing cloud in patient health monitoring
Ashok Rangaswamy
 
Ad

Similar to Practical Predictive Modeling in Python (20)

Start machine learning in 5 simple steps
Start machine learning in 5 simple steps
Renjith M P
 
OpenML 2019
OpenML 2019
Joaquin Vanschoren
 
Data herding
Data herding
unbracketed
 
Data herding
Data herding
unbracketed
 
wk5ppt2_Iris
wk5ppt2_Iris
AliciaWei1
 
Case Study with the use of KERAS EMERSON EDUARDO RODRIGUES.pdf
Case Study with the use of KERAS EMERSON EDUARDO RODRIGUES.pdf
EMERSON EDUARDO RODRIGUES
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
Hands-on - Machine Learning using scikitLearn
Hands-on - Machine Learning using scikitLearn
avrtraining021
 
Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
More on Pandas.pptx
More on Pandas.pptx
VirajPathania1
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
Advanced WhizzML Workflows
Advanced WhizzML Workflows
BigML, Inc
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Workshop: Your first machine learning project
Workshop: Your first machine learning project
Alex Austin
 
Data Migrations in the App Engine Datastore
Data Migrations in the App Engine Datastore
Ryan Morlok
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
ML-Ops how to bring your data science to production
ML-Ops how to bring your data science to production
Herman Wu
 
python for data anal gh i o fytysis creation.pptx
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
data_preprocessingknnnaiveandothera.pptx
data_preprocessingknnnaiveandothera.pptx
nikhilguptha06
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
Aaron (Ari) Bornstein
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple steps
Renjith M P
 
Case Study with the use of KERAS EMERSON EDUARDO RODRIGUES.pdf
Case Study with the use of KERAS EMERSON EDUARDO RODRIGUES.pdf
EMERSON EDUARDO RODRIGUES
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
Hands-on - Machine Learning using scikitLearn
Hands-on - Machine Learning using scikitLearn
avrtraining021
 
Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
Advanced WhizzML Workflows
Advanced WhizzML Workflows
BigML, Inc
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Workshop: Your first machine learning project
Workshop: Your first machine learning project
Alex Austin
 
Data Migrations in the App Engine Datastore
Data Migrations in the App Engine Datastore
Ryan Morlok
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
ML-Ops how to bring your data science to production
ML-Ops how to bring your data science to production
Herman Wu
 
python for data anal gh i o fytysis creation.pptx
python for data anal gh i o fytysis creation.pptx
Vinod Deenathayalan
 
data_preprocessingknnnaiveandothera.pptx
data_preprocessingknnnaiveandothera.pptx
nikhilguptha06
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
Aaron (Ari) Bornstein
 
Ad

More from Robert Dempsey (20)

Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
Robert Dempsey
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
Growth Hacking 101
Growth Hacking 101
Robert Dempsey
 
Web Scraping With Python
Web Scraping With Python
Robert Dempsey
 
DC Python Intro Slides - Rob's Version
DC Python Intro Slides - Rob's Version
Robert Dempsey
 
Content Marketing Strategy for 2013
Content Marketing Strategy for 2013
Robert Dempsey
 
Creating Lead-Generating Social Media Campaigns
Creating Lead-Generating Social Media Campaigns
Robert Dempsey
 
Goal Writing Workshop
Goal Writing Workshop
Robert Dempsey
 
Google AdWords Introduction
Google AdWords Introduction
Robert Dempsey
 
20 Tips For Freelance Success
20 Tips For Freelance Success
Robert Dempsey
 
How To Turn Your Business Into A Media Powerhouse
How To Turn Your Business Into A Media Powerhouse
Robert Dempsey
 
Agile Teams as Innovation Teams
Agile Teams as Innovation Teams
Robert Dempsey
 
Introduction to kanban
Introduction to kanban
Robert Dempsey
 
Get The **** Up And Market
Get The **** Up And Market
Robert Dempsey
 
Introduction To Inbound Marketing
Introduction To Inbound Marketing
Robert Dempsey
 
Writing Agile Requirements
Writing Agile Requirements
Robert Dempsey
 
Twitter For Business
Twitter For Business
Robert Dempsey
 
Introduction To Scrum For Managers
Introduction To Scrum For Managers
Robert Dempsey
 
Introduction to Agile for Managers
Introduction to Agile for Managers
Robert Dempsey
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
Robert Dempsey
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
Web Scraping With Python
Web Scraping With Python
Robert Dempsey
 
DC Python Intro Slides - Rob's Version
DC Python Intro Slides - Rob's Version
Robert Dempsey
 
Content Marketing Strategy for 2013
Content Marketing Strategy for 2013
Robert Dempsey
 
Creating Lead-Generating Social Media Campaigns
Creating Lead-Generating Social Media Campaigns
Robert Dempsey
 
Google AdWords Introduction
Google AdWords Introduction
Robert Dempsey
 
20 Tips For Freelance Success
20 Tips For Freelance Success
Robert Dempsey
 
How To Turn Your Business Into A Media Powerhouse
How To Turn Your Business Into A Media Powerhouse
Robert Dempsey
 
Agile Teams as Innovation Teams
Agile Teams as Innovation Teams
Robert Dempsey
 
Introduction to kanban
Introduction to kanban
Robert Dempsey
 
Get The **** Up And Market
Get The **** Up And Market
Robert Dempsey
 
Introduction To Inbound Marketing
Introduction To Inbound Marketing
Robert Dempsey
 
Writing Agile Requirements
Writing Agile Requirements
Robert Dempsey
 
Introduction To Scrum For Managers
Introduction To Scrum For Managers
Robert Dempsey
 
Introduction to Agile for Managers
Introduction to Agile for Managers
Robert Dempsey
 

Recently uploaded (20)

Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
Reliability Monitoring of Aircrfat commerce
Reliability Monitoring of Aircrfat commerce
Rizk2
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
lecture12.pdf Introduction to bioinformatics
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
Predicting Titanic Survival Presentation
Predicting Titanic Survival Presentation
praxyfarhana
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
Taqyea
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Informatics Market Insights AI Workforce.pdf
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
624753984-Annex-A3-RPMS-Tool-for-Proficient-Teachers-SY-2024-2025.pdf
CristineGraceAcuyan
 
Reliability Monitoring of Aircrfat commerce
Reliability Monitoring of Aircrfat commerce
Rizk2
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
lecture12.pdf Introduction to bioinformatics
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
Predicting Titanic Survival Presentation
Predicting Titanic Survival Presentation
praxyfarhana
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
Taqyea
 

Practical Predictive Modeling in Python

  • 1. Practical Predictive Modeling in Python Robert Dempsey robertwdempsey.com
  • 4. Doing All Things In SQL Makes Panda sad and confused
  • 5. Each New Thing You Learn Leads to another new thing to learn, and another, and…
  • 6. So Many Things 1. Which predictive modeling technique to use 2. How to get the data into a format for modeling 3. How to ensure the “right” data is being used 4. How to feed the data into the model 5. How to validate the model results 6. How to save the model to use in production 7. How to implement the model in production and apply it to new observations 8. How to save the new predictions 9. How to ensure, over time, that the model is correctly predicting outcomes 10.How to later update the model with new training data
  • 9. Model Selection • How much data do you have? • Are you predicting a category? A quantity? • Do you have labeled data? • Do you know the number of categories? • How much data do you have?
  • 10. Regression • Used for estimating the relationships among variables • Use when: • Predicting a quantity • More than 50 samples
  • 11. Classification • Used to answer “what is this object” • Use when: • Predicting a category • Have labeled data
  • 12. Clustering • Used to group similar objects • Use when: • Predicting a category • Don’t have labeled data • Number of categories is known or unknown • Have more than 50 samples
  • 13. Dimensionality Reduction • Process for reducing the number of random variables under consideration (feature selection and feature extraction) • Use when: • Not predicting a category or a quantity • Just looking around
  • 16. Format The Data • Pandas FTW! • Use the map() function to convert any text to a number • Fill in any missing values • Split the data into features (the data) and targets (the outcome to predict) using .values on the DataFrame
  • 17. map() def update_failure_explanations(type): if type == 'dob': return 0 elif type == 'name': return 1 elif type == 'ssn dob name': return 2 elif type == 'ssn': return 3 elif type == 'ssn name': return 4 elif type == 'ssn dob': return 5 elif type == 'dob name': return 6
  • 18. Fill In Missing Values df.my_field.fillna(‘Missing', inplace=True) df.fillna(0, inplace=True)
  • 19. Split the Data t_data = raw_data.iloc[:,0:22].values 1. Create a matrix of values t_targets = raw_data['verified'].values 2. Create a matrix of targets
  • 21. Get The Right Data • This is called “Feature selection” • Univariate feature selection • SelectKBest removes all but the k highest scoring features • SelectPercentile removes all but a user-specified highest scoring percentage of features using common univariate statistical tests for each feature: false positive rate • SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe. • GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. http://scikit-learn.org/stable/modules/feature_selection.html
  • 23. Data => Model 1. Build the model http://scikit-learn.org/stable/modules/cross_validation.html from sklearn import linear_model logClassifier = linear_model.LogisticRegression(C=1, random_state=111) 2. Train the model from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split(the_data, the_targets, cv=12, test_size=0.20, random_state=111) logClassifier.fit(X_train, y_train)
  • 25. Validation 1. Accuracy Score http://scikit-learn.org/stable/modules/cross_validation.html from sklearn import metrics metrics.accuracy_score(y_test, predicted) 2. Confusion Matrix metrics.confusion_matrix(y_test, predicted)
  • 27. Save the Model Pickle it! https://docs.python.org/3/library/pickle.html import pickle model_file = "/lr_classifier_09.29.15.dat" pickle.dump(logClassifier, open(model_file, "wb")) Did it work? logClassifier2 = pickle.load(open(model, "rb")) print(logClassifier2)
  • 29. Implement in Production • Clean the data the same way you did for the model • Feature mappings • Column re-ordering • Create a function that returns the prediction • Deserialize the model from the file you created • Feed the model the data in the same order • Call .predict() and get your answer
  • 30. Example def verify_record(record_scores): # Reload the trained model tif = "models/t_lr_classifier_07.28.15.dat" log_classifier = pickle.load(open(tcf, "rb")) # Return the prediction return log_classifier.predict(record_scores)[0]
  • 32. Save Your Predictions As you would any other piece of data
  • 34. Unleash the minion army! … or get more creative
  • 36. Be Smart Train it again, but with validated predictions
  • 38. Step Review 1. Select a predictive modeling technique to use 2. Get the data into a format for modeling 3. Ensure the “right” data is being used 4. Feed the data into the model 5. Validate the model results
  • 39. Step Review 6. Save the model to use in production 7. Implement the model in production and apply it to new observations 8. Save the new predictions 9. Ensure the model is correctly predicting outcomes over time 10. Update the model with new training data
  • 42. Image Credits • Format: https://www.flickr.com/photos/zaqography/3835692243/ • Get right data: https://www.flickr.com/photos/encouragement/14759554777/ • Feed: https://www.flickr.com/photos/glutnix/4291194/ • Validate: https://www.flickr.com/photos/lord-jim/16827236591/ • Save: http://www.cnn.com/2015/09/13/living/candice-swanepoel-victorias-secret-model-falls-feat/ • Ship It: https://www.flickr.com/photos/oneeighteen/15492277272/ • Save Predictions: https://www.flickr.com/photos/eelssej_/486414113/ • Get it right: https://www.flickr.com/photos/clickflashphotos/3402287993/ • Update it: https://www.flickr.com/photos/dullhunk/5497202855/ • Review: https://www.flickr.com/photos/pluggedmind/10714537023/