Introduction to Machine
Learning with Scikit-Learn
Types of Algorithms by Output
Input training data to fit a model which is then
used to predict incoming inputs into ...
Type of Output Algorithm Category
Output is one or more discrete classes Classification (supervised)
Output is continuous Regression (supervised)
Output is membership in a similar group Clustering (unsupervised)
Output is the distribution of inputs Density Estimation
Output is simplified from higher dimensions Dimensionality Reduction
Classification
Given labeled input data (with two or more labels), fit a
function that can determine for any input, what the label is.
Regression
Given continuous input data fit a function that is able to
predict the continuous value of input given other data.
Clustering
Given data, determine a pattern of associated data points
or clusters via their similarity or distance from one another.
Hadley Wickham (2015)
“Model” is an overloaded term.
•Model family describes, at the broadest possible level, the
connection between the variables of interest.
•Model form specifies exactly how the variables of interest
are connected within the framework of the model family.
•A fitted model is a concrete instance of the
model form where all parameters have been
estimated from data, and the model can be
used to generate predictions.
http://had.co.nz/stat645/model-vis.pdf
Dimensions and Features
In order to do machine learning you need a data set containing
instances (examples) that are composed of features from which
you compose dimensions.
Instance: a single data point or example composed of fields
Feature: a quantity describing an instance
Dimension: one or more attributes that describe a property
from sklearn.datasets import l o a d _ d i g i t s
digit s = load_digits()
X= d i g i t s . d a t a # X.shape == (n_samples, n_features)
y = digits.target # y.shape == (n_samples,)
Feature Space
Feature space refers to the n-dimensions where your variables live (not
including a target variable or class). The term is used often in ML literature
because in ML all variables are features (usually) and feature extraction is the
art of creating a space with decision boundaries.
Target
1. Y ≡ Thickness of car tires after some testing period
Variables
1. X1 ≡ distance travelled in test
2. X2 ≡ time duration of test
3. X3 ≡ amount of chemical C in tires
The feature space is R3, or more accurately, the positive quadrant in R3 as all
the X variables can only be positive quantities.
http://stats.stackexchange.com/questions/46425/what-is-feature-space
Mappings
Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4 (this is the
feature extraction part):
X4 = X1*X2 ≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4.
A mapping is a function, ϕ, from R3 to R4:
ϕ(x1,x2,x3) = (x1,x2,x3,x1x2)
http://stats.stackexchange.com/questions/46425/what-is-feature-space
Your Task
Given a data set of instances of size N, create
a model that is fit from the data (built) by
extracting features and dimensions. Then use
that model to predict outcomes …
1. Data Wrangling (normalization, standardization, imputing)
2. Feature Analysis/Extraction
3. Model Selection/Building
4. Model Evaluation
5. Operationalize Model
A Tour of Machine Learning
Algorithms
Models: Instance Methods
Compare instances in data set with a similarity
measure to find best matches.
- Suffers from curse of dimensionality.
- Focus on feature representation and
similarity metrics between instances
● k-Nearest Neighbors (kNN)
● Self-Organizing Maps (SOM)
● Learning Vector Quantization (LVQ)
Self-Organizing Maps
Models: Regression
Model relationship of independent variables, X
to dependent variable Y by iteratively
optimizing error made in predictions.
● Ordinary Least Squares
● Logistic Regression
● Stepwise Regression
● Multivariate Adaptive Regression Splines (MARS)
● Locally Estimated Scatterplot Smoothing (LOESS)
Logistic Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)
• Combine multiple regression models
in a k-nearest-neighbor-based meta-
model
• Fits a low-degree polynomial to a
subset of the data close to the current
point
• Requires fairly large, densely sampled
data sets in order to produce good
models.
Models: Regularization Methods
Extend another method (usually regression),
penalizing complexity (minimize overfit)
- simple, popular, powerful
- better at generalization
● Ridge Regression
● LASSO (Least Absolute Shrinkage & Selection Operator)
● Elastic Net
Models: Regularization Methods
LASSO
• Limits total weight of parameters
• Can be interpreted as a prior distribution on parameters
• Ridge regression: quadratic penalty
• Elastic Net combines both Laplace prior distributions
Models: Decision Trees
Model of decisions based on data attributes.
Predictions are made by following forks in a
tree structure until a decision is made. Used for
classification & regression.
● Classification and Regression Tree (CART)
● Decision Stump
● Random Forest
● Multivariate Adaptive Regression Splines (MARS)
● Gradient Boosting Machines (GBM)
Models: Decision Trees
http://www.saedsayad.com/decision_tree.htm
Models: Bayesian
Explicitly apply Bayes’ Theorem for
classification and regression tasks. Usually by
fitting a probability function constructed via the
chain rule and a naive simplification of Bayes.
● Naive Bayes
● Averaged One-Dependence Estimators (AODE)
● Bayesian Belief Network (BBN)
Naive Bayes
- Text retrieval 1960s
- Independence of feature values (given class)
- Bayesian theorem
Probability distribution:
- Voting for max. posterior probability (MAP)
Models: Kernel Methods
Map input data into higher dimensional vector
space where the problem is easier to model.
Named after the “kernel trick” which computes
the inner product of images of pairs of data.
● Support Vector Machines (SVM)
● Radial Basis Function (RBF)
● Linear Discriminant Analysis (LDA)
SVM
SVM
Models: Clustering Methods
Organize data into into groups whose members
share maximum similarity (defined usually by a
distance metric). Two main approaches:
centroids and hierarchical clustering.
● k-Means
● Affinity Propegation
● OPTICS (Ordering Points to Identify Cluster Structure)
● Agglomerative Clustering
K-means clustering
Models: Artificial Neural Networks
Inspired by biological neural networks, ANNs are
nonlinear function approximators that estimate
functions with a large number of inputs.
- System of interconnected neurons that activate
- Deep learning extends simple networks recursively
● Restricted Boltzmann Machine (RBM)
● Convolutional Neural Networks (CNN)
● Recurrent Neural Networks (RNN)
● Word2Vec models
Models: Artificial Neural Networks
Models: Ensembles
Models composed of multiple weak models that
are trained independently and whose outputs
are combined to make an overall prediction.
● Boosting
● Bootstrapped Aggregation (Bagging)
● AdaBoost
● Stacked Generalization (blending)
● Gradient Boosting Machines (GBM)
● Random Forest
AdaBoost
AdaBoost
Models: Other
The list before was not comprehensive, other
algorithm and model classes include:
● Conditional Random Fields (CRF)
● Markovian Models (HMMs)
● Dimensionality Reduction (PCA, PLS)
● Rule Learning (Apriori, Brill)
● More ...
What is Scikit-Learn?
Extensions to SciPy (Scientific Python) are
called SciKits. SciKit-Learn provides machine
learning algorithms.
● Algorithms for supervised & unsupervised learning
● Built on SciPy and Numpy
● Standard Python API interface
● Sits on top of c libraries, LAPACK, LibSVM, and Cython
● Open Source: BSD License (part of Linux)
Probably the best general ML framework out there.
Primary Features
- Generalized Linear Models
- SVMs, kNN, Bayes, Decision Trees, Ensembles
- Clustering and Density algorithms
- Cross Validation
- Grid Search
- Pipelining
- Model Evaluations
- Dataset Transformations
- Dataset Loading
A Guide to Scikit-Learn
Scikit-Learn API
Object-oriented interface centered around the
concept of an Estimator:
“An estimator is any object that learns from data; it may
be a classification, regression or clustering algorithm or
a transformer that extracts/filters useful features from
raw data.”
- Scikit-Learn Tutorial
class Estimator(object):
def f i t ( s e l f , X, y=None):
" " " F i t s estimator to d a t a . " " "
# set state o f ` ` s e l f ` `
returns e l f
def p r e d i c t ( s e l f , X ) :
" " " P r e d i c t response o f` ` X ` ` . " " "
# compute p r e d i c t i o n s ` ` p r e d ` `
return pred
The Scikit-Learn Estimator API
Estimators
- f i t ( X , y ) sets the state of the estimator.
- X is usually a 2D numpy array of shape
(num_samples, num_features).
- y is a 1D array with shape (n_samples,)
- p r e d i c t ( X ) returns the class or value
- predict_proba() returns a 2D array of
shape (n_samples, n_classes)
from sklearn import svm
estimato r = svm. SVC(gamma=0.001)
e s t i m a t o r. f i t ( X , y)
e s t i m a t o r. p r e d i c t ( x )
Basic methodology
Wrapping fit and predict
We’ve already discussed a broad workflow, the
following is a development workflow:
Feature Feature
Raw Data
Extraction Evaluation
Load &
Build Model Evaluate Model
Transform Data
Task 6
- Select dataset (wines / student performance)
- Apply various learning algorithms on the
problem
- Provide best possible prediction w.r.t. RMSE
Best bet to start with:
- Student performance: Random forrest /
boosting trees
- Wines: MARS / LASSO / Bagging linear
model
Semestrální projekt – varianta 2
• Drug-target interaction prediction
– Predikce interakcí mezi léčivými látkami a proteiny
– DTInet dataset
• https://github.com/luoyunan/DTINet
– Známé interakce (Binary)
– Strukturální podobnost drugs i targets [0,1]
– Namapování na nemoci a side effects
– Evaluace:
• Cílem je ranking prediction (seřazení objektů od nejlepšího po nejhorší)
• 10-fold cross validace (náhodně se skryje 10% interakcí, vaším úkolem je co možná
nejlépe je seřadit, zdroják bude součástí zadání)
• Area Under ROC Curve, Area Under Precision-Recall Curve
– Odevzdané řešení musí být lepší než základní baseline: BLM
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735674/ (zdrojáky budou k
dispozici)
– Typické algoritmy:
• faktorizace matic e interakcí (případně obohacená o externí data)
• Nearest neighbors a grafové algoritmy
• Lokální modely predikující hrany na základě vlastností interagující drug a target