SlideShare a Scribd company logo
PythonForDataScience Cheat Sheet
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com
Scikit-learn
DataCamp
Learn Python for Data Science Interactively
Loading The Data Also see NumPy & Pandas
Scikit-learn is an open source Python library that
implements a range of machine learning,
preprocessing, cross-validation and visualization
algorithms using a unified interface.
>>> import numpy as np
>>> X = np.random.random((10,5))
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse
matrices. Other types that are convertible to numeric arrays, such as Pandas
DataFrame, are also acceptable.
Create Your Model
Model Fitting
Prediction
Tune Your Model
Evaluate Your Model’s Performance
Grid Search
Randomized Parameter Optimization
Linear Regression
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression(normalize=True)
Support Vector Machines (SVM)
>>> from sklearn.svm import SVC
>>> svc = SVC(kernel='linear')
Naive Bayes
>>> from sklearn.naive_bayes import GaussianNB
>>> gnb = GaussianNB()
KNN
>>> from sklearn import neighbors
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
Supervised learning
>>> lr.fit(X, y)
>>> knn.fit(X_train, y_train)
>>> svc.fit(X_train, y_train)
Unsupervised Learning
>>> k_means.fit(X_train)
>>> pca_model = pca.fit_transform(X_train)
Accuracy Score
>>> knn.score(X_test, y_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
Classification Report
>>> from sklearn.metrics import classification_report
>>> print(classification_report(y_test, y_pred))
Confusion Matrix
>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(y_test, y_pred))
Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
>>> print(cross_val_score(knn, X_train, y_train, cv=4))
>>> print(cross_val_score(lr, X, y, cv=2))
Classification Metrics
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
"metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
param_grid=params)
>>> grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)
>>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
"weights": ["uniform", "distance"]}
>>> rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,	
			 cv=4,
			 n_iter=8,
			 random_state=5)
>>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
A Basic Example
>>> from sklearn import neighbors, datasets, preprocessing
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> iris = datasets.load_iris()
>>> X, y = iris.data[:, :2], iris.target
>>> X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=33)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train = scaler.transform(X_train)
>>> X_test = scaler.transform(X_test)
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred)
Supervised Learning Estimators
Unsupervised Learning Estimators
Principal Component Analysis (PCA)
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=0.95)
K Means
>>> from sklearn.cluster import KMeans
>>> k_means = KMeans(n_clusters=3, random_state=0)
Fit the model to the data
Fit the model to the data
Fit to data, then transform it
Preprocessing The Data
Standardization
Normalization
>>> from sklearn.preprocessing import Normalizer
>>> scaler = Normalizer().fit(X_train)
>>> normalized_X = scaler.transform(X_train)
>>> normalized_X_test = scaler.transform(X_test)
Training And Test Data
>>> from sklearn.cross_validation import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y,
random_state=0)
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(X_train)
>>> standardized_X = scaler.transform(X_train)
>>> standardized_X_test = scaler.transform(X_test)
Binarization
>>> from sklearn.preprocessing import Binarizer
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> binary_X = binarizer.transform(X)
Encoding Categorical Features
Supervised Estimators
>>> y_pred = svc.predict(np.random.random((2,5)))
>>> y_pred = lr.predict(X_test)
>>> y_pred = knn.predict_proba(X_test)
Unsupervised Estimators
>>> y_pred = k_means.predict(X_test)
>>> from sklearn.preprocessing import LabelEncoder
>>> enc = LabelEncoder()
>>> y = enc.fit_transform(y)
Imputing Missing Values
Predict labels
Predict labels
Estimate probability of a label
Predict labels in clustering algos
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit_transform(X_train)
Generating Polynomial Features
>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly = PolynomialFeatures(5)
>>> poly.fit_transform(X)
Regression Metrics
Mean Absolute Error
>>> from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2]
>>> mean_absolute_error(y_true, y_pred)
Mean Squared Error
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(y_test, y_pred)
R² Score
>>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
Clustering Metrics
Adjusted Rand Index
>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred)
Homogeneity
>>> from sklearn.metrics import homogeneity_score
>>> homogeneity_score(y_true, y_pred)
V-measure
>>> from sklearn.metrics import v_measure_score
>>> metrics.v_measure_score(y_true, y_pred)
Estimator score method
Metric scoring functions
Precision, recall, f1-score
and support

More Related Content

PDF
Python Pandas for Data Science cheatsheet
PDF
Python For Data Science Cheat Sheet
PDF
Python matplotlib cheat_sheet
PDF
Pandas Cheat Sheet
PDF
Numpy python cheat_sheet
PDF
Pandas pythonfordatascience
PDF
Python seaborn cheat_sheet
PDF
Pandas,scipy,numpy cheatsheet
Python Pandas for Data Science cheatsheet
Python For Data Science Cheat Sheet
Python matplotlib cheat_sheet
Pandas Cheat Sheet
Numpy python cheat_sheet
Pandas pythonfordatascience
Python seaborn cheat_sheet
Pandas,scipy,numpy cheatsheet

What's hot (20)

PDF
Python3 cheatsheet
PDF
Python bokeh cheat_sheet
PDF
Basic data structures in python
PDF
Arrays in python
PDF
Cheat sheet python3
PDF
Python_ 3 CheatSheet
PDF
Numpy tutorial(final) 20160303
PDF
Scientific Computing with Python - NumPy | WeiYuan
PPTX
Arrays in java
PDF
Python 2.5 reference card (2009)
PDF
Introduction to NumPy for Machine Learning Programmers
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
PPSX
C Programming : Arrays
PPT
Two dimensional array
PPTX
2- Dimensional Arrays
PDF
Lecture17 arrays.ppt
PDF
11 1. multi-dimensional array eng
PDF
1 pythonbasic
PDF
Arrays in python
PDF
Cheat Sheet for Machine Learning in Python: Scikit-learn
Python3 cheatsheet
Python bokeh cheat_sheet
Basic data structures in python
Arrays in python
Cheat sheet python3
Python_ 3 CheatSheet
Numpy tutorial(final) 20160303
Scientific Computing with Python - NumPy | WeiYuan
Arrays in java
Python 2.5 reference card (2009)
Introduction to NumPy for Machine Learning Programmers
Python - Numpy/Pandas/Matplot Machine Learning Libraries
C Programming : Arrays
Two dimensional array
2- Dimensional Arrays
Lecture17 arrays.ppt
11 1. multi-dimensional array eng
1 pythonbasic
Arrays in python
Cheat Sheet for Machine Learning in Python: Scikit-learn
Ad

Viewers also liked (20)

PDF
Follow up SPARK
PPT
Statistical Test
PDF
Python for Data Science
PDF
A+ cheat sheet
PDF
Linux cheat-sheet
DOCX
Naive Bayes Example using R
PDF
PDF
Advanced R cheat sheet
PDF
Data Exploration and Visualization with R
PDF
2013.06.18 Time Series Analysis Workshop ..Applications in Physiology, Climat...
PDF
Python Cheat Sheet
PDF
Cheat sheets for data scientists
PDF
R Reference Card for Data Mining
PDF
Regression and Classification with R
PDF
Python for Data Science
PDF
Time Series Analysis: Theory and Practice
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PDF
5分でわかるかもしれないglmnet
PDF
Python for Data Science - Python Brasil 11 (2015)
PPTX
ISO 9001:2008
Follow up SPARK
Statistical Test
Python for Data Science
A+ cheat sheet
Linux cheat-sheet
Naive Bayes Example using R
Advanced R cheat sheet
Data Exploration and Visualization with R
2013.06.18 Time Series Analysis Workshop ..Applications in Physiology, Climat...
Python Cheat Sheet
Cheat sheets for data scientists
R Reference Card for Data Mining
Regression and Classification with R
Python for Data Science
Time Series Analysis: Theory and Practice
ML on Big Data: Real-Time Analysis on Time Series
5分でわかるかもしれないglmnet
Python for Data Science - Python Brasil 11 (2015)
ISO 9001:2008
Ad

Similar to Scikit-learn Cheatsheet-Python (20)

PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Cheat sheets for AI
PPTX
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
KEY
NumPy/SciPy Statistics
PPTX
random_forest_ppt.pptxhgvghvhjghjghjghjghjghjjh
PDF
Feature Engineering - Getting most out of data for predictive models
PPTX
2.3 SciPy library explained detialed 1.pptx
PDF
Keras cheat sheet_python
PPTX
Data preprocessing for Machine Learning with R and Python
PPTX
Session 06 machine learning.pptx
PPTX
Session 06 machine learning.pptx
PDF
How to use SVM for data classification
PDF
ML with python.pdf
PDF
Julie Michelman - Pandas, Pipelines, and Custom Transformers
PDF
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
PPTX
svm classification
PDF
Pythonで機械学習入門以前
PPTX
knn classification
PPTX
Learning Predictive Modeling with TSA and Kaggle
PPTX
wk5ppt2_Iris
Nyc open-data-2015-andvanced-sklearn-expanded
Cheat sheets for AI
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
NumPy/SciPy Statistics
random_forest_ppt.pptxhgvghvhjghjghjghjghjghjjh
Feature Engineering - Getting most out of data for predictive models
2.3 SciPy library explained detialed 1.pptx
Keras cheat sheet_python
Data preprocessing for Machine Learning with R and Python
Session 06 machine learning.pptx
Session 06 machine learning.pptx
How to use SVM for data classification
ML with python.pdf
Julie Michelman - Pandas, Pipelines, and Custom Transformers
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
svm classification
Pythonで機械学習入門以前
knn classification
Learning Predictive Modeling with TSA and Kaggle
wk5ppt2_Iris

More from Dr. Volkan OBAN (20)

PDF
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
PDF
Covid19py Python Package - Example
PDF
Object detection with Python
PDF
Python - Rastgele Orman(Random Forest) Parametreleri
DOCX
Linear Programming wi̇th R - Examples
DOCX
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
DOCX
k-means Clustering in Python
DOCX
R forecasting Example
DOCX
k-means Clustering and Custergram with R
PDF
Data Science and its Relationship to Big Data and Data-Driven Decision Making
DOCX
Data Visualization with R.ggplot2 and its extensions examples.
PPTX
ReporteRs package in R. forming powerpoint documents-an example
PPTX
ReporteRs package in R. forming powerpoint documents-an example
DOCX
R-ggplot2 package Examples
DOCX
R Machine Learning packages( generally used)
DOCX
treemap package in R and examples.
DOCX
Mosaic plot in R.
DOCX
imager package in R and examples..
PDF
R-Data table Cheat Sheet
DOCX
Advanced Data Visualization Examples with R-Part II
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Covid19py Python Package - Example
Object detection with Python
Python - Rastgele Orman(Random Forest) Parametreleri
Linear Programming wi̇th R - Examples
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
k-means Clustering in Python
R forecasting Example
k-means Clustering and Custergram with R
Data Science and its Relationship to Big Data and Data-Driven Decision Making
Data Visualization with R.ggplot2 and its extensions examples.
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an example
R-ggplot2 package Examples
R Machine Learning packages( generally used)
treemap package in R and examples.
Mosaic plot in R.
imager package in R and examples..
R-Data table Cheat Sheet
Advanced Data Visualization Examples with R-Part II

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Transcultural that can help you someday.
PPTX
Database Infoormation System (DBIS).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Leprosy and NLEP programme community medicine
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Predictive modeling basics in data cleaning process
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
modul_python (1).pptx for professional and student
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Transcultural that can help you someday.
Database Infoormation System (DBIS).pptx
Mega Projects Data Mega Projects Data
Qualitative Qantitative and Mixed Methods.pptx
Clinical guidelines as a resource for EBP(1).pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Analytics and business intelligence.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Galatica Smart Energy Infrastructure Startup Pitch Deck
Leprosy and NLEP programme community medicine
ISS -ESG Data flows What is ESG and HowHow
Predictive modeling basics in data cleaning process
Optimise Shopper Experiences with a Strong Data Estate.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
modul_python (1).pptx for professional and student
Data_Analytics_and_PowerBI_Presentation.pptx

Scikit-learn Cheatsheet-Python

  • 1. PythonForDataScience Cheat Sheet Scikit-Learn Learn Python for data science Interactively at www.DataCamp.com Scikit-learn DataCamp Learn Python for Data Science Interactively Loading The Data Also see NumPy & Pandas Scikit-learn is an open source Python library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms using a unified interface. >>> import numpy as np >>> X = np.random.random((10,5)) >>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F']) >>> X[X < 0.7] = 0 Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas DataFrame, are also acceptable. Create Your Model Model Fitting Prediction Tune Your Model Evaluate Your Model’s Performance Grid Search Randomized Parameter Optimization Linear Regression >>> from sklearn.linear_model import LinearRegression >>> lr = LinearRegression(normalize=True) Support Vector Machines (SVM) >>> from sklearn.svm import SVC >>> svc = SVC(kernel='linear') Naive Bayes >>> from sklearn.naive_bayes import GaussianNB >>> gnb = GaussianNB() KNN >>> from sklearn import neighbors >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) Supervised learning >>> lr.fit(X, y) >>> knn.fit(X_train, y_train) >>> svc.fit(X_train, y_train) Unsupervised Learning >>> k_means.fit(X_train) >>> pca_model = pca.fit_transform(X_train) Accuracy Score >>> knn.score(X_test, y_test) >>> from sklearn.metrics import accuracy_score >>> accuracy_score(y_test, y_pred) Classification Report >>> from sklearn.metrics import classification_report >>> print(classification_report(y_test, y_pred)) Confusion Matrix >>> from sklearn.metrics import confusion_matrix >>> print(confusion_matrix(y_test, y_pred)) Cross-Validation >>> from sklearn.cross_validation import cross_val_score >>> print(cross_val_score(knn, X_train, y_train, cv=4)) >>> print(cross_val_score(lr, X, y, cv=2)) Classification Metrics >>> from sklearn.grid_search import GridSearchCV >>> params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]} >>> grid = GridSearchCV(estimator=knn, param_grid=params) >>> grid.fit(X_train, y_train) >>> print(grid.best_score_) >>> print(grid.best_estimator_.n_neighbors) >>> from sklearn.grid_search import RandomizedSearchCV >>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]} >>> rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5) >>> rsearch.fit(X_train, y_train) >>> print(rsearch.best_score_) A Basic Example >>> from sklearn import neighbors, datasets, preprocessing >>> from sklearn.cross_validation import train_test_split >>> from sklearn.metrics import accuracy_score >>> iris = datasets.load_iris() >>> X, y = iris.data[:, :2], iris.target >>> X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=33) >>> scaler = preprocessing.StandardScaler().fit(X_train) >>> X_train = scaler.transform(X_train) >>> X_test = scaler.transform(X_test) >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> knn.fit(X_train, y_train) >>> y_pred = knn.predict(X_test) >>> accuracy_score(y_test, y_pred) Supervised Learning Estimators Unsupervised Learning Estimators Principal Component Analysis (PCA) >>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=0.95) K Means >>> from sklearn.cluster import KMeans >>> k_means = KMeans(n_clusters=3, random_state=0) Fit the model to the data Fit the model to the data Fit to data, then transform it Preprocessing The Data Standardization Normalization >>> from sklearn.preprocessing import Normalizer >>> scaler = Normalizer().fit(X_train) >>> normalized_X = scaler.transform(X_train) >>> normalized_X_test = scaler.transform(X_test) Training And Test Data >>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> from sklearn.preprocessing import StandardScaler >>> scaler = StandardScaler().fit(X_train) >>> standardized_X = scaler.transform(X_train) >>> standardized_X_test = scaler.transform(X_test) Binarization >>> from sklearn.preprocessing import Binarizer >>> binarizer = Binarizer(threshold=0.0).fit(X) >>> binary_X = binarizer.transform(X) Encoding Categorical Features Supervised Estimators >>> y_pred = svc.predict(np.random.random((2,5))) >>> y_pred = lr.predict(X_test) >>> y_pred = knn.predict_proba(X_test) Unsupervised Estimators >>> y_pred = k_means.predict(X_test) >>> from sklearn.preprocessing import LabelEncoder >>> enc = LabelEncoder() >>> y = enc.fit_transform(y) Imputing Missing Values Predict labels Predict labels Estimate probability of a label Predict labels in clustering algos >>> from sklearn.preprocessing import Imputer >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) >>> imp.fit_transform(X_train) Generating Polynomial Features >>> from sklearn.preprocessing import PolynomialFeatures >>> poly = PolynomialFeatures(5) >>> poly.fit_transform(X) Regression Metrics Mean Absolute Error >>> from sklearn.metrics import mean_absolute_error >>> y_true = [3, -0.5, 2] >>> mean_absolute_error(y_true, y_pred) Mean Squared Error >>> from sklearn.metrics import mean_squared_error >>> mean_squared_error(y_test, y_pred) R² Score >>> from sklearn.metrics import r2_score >>> r2_score(y_true, y_pred) Clustering Metrics Adjusted Rand Index >>> from sklearn.metrics import adjusted_rand_score >>> adjusted_rand_score(y_true, y_pred) Homogeneity >>> from sklearn.metrics import homogeneity_score >>> homogeneity_score(y_true, y_pred) V-measure >>> from sklearn.metrics import v_measure_score >>> metrics.v_measure_score(y_true, y_pred) Estimator score method Metric scoring functions Precision, recall, f1-score and support