Methods for meta learning in AutoML

Mohamed Maher - University of Tartu - 2019 - mohamed.abdelrahman@ut.ee
Methods for Meta-Learning in AutoML
Learning how to Learn
1

Motivational Example
● Alex starts to learn a Maths course of 10 tests for
the first time in his life. (Problem)
● Alex wants to get a grade A in most of the
course tests. (Target)
● Alex thought that attending all lectures would
easily help him to get grade A like what he
always does in history courses. (Approach 1)
2

● Alex got grade D in his first test. (Result 1)
● Alex decided to switch to reading the reference
book instead of attending all lectures only.
(Approach 2)
● Alex got grade C in his second test. (Result 2)
3

● After that, Alex decided to switch to solving practice
problems instead. (Approach 3)
● Alex got grade B in his third test. (Result 3)
● So, Alex decided to summarize each lesson and
teach it to his colleagues too. (Approach 4)
● Alex got grade A in his fourth test. (Result 4)
4

● Now, the question is how will Alex study for his 5th
test in the course ?
5
Alex has already learnt how to learn.

Back to Machine Learning
6

Motivation
7
Data
Collection
1. Data
Preprocessing
2. Feature
Extraction
3. Feature
Selection
4.
Algorithm
Selection
Deploym
ent
5.
Parameter
Tuning
Prediction
Real-World
Data Feature Engineering Model Building
Typical Supervised Machine Learning Pipeline

8
Model Building
4.
Algorithm
Selection
5.
Parameter
Tuning
Examples:
- Linear Classification: (Simple Linear Classification, Ridge, Lasso, Simple Perceptron, ….)
- Support Vector Machines
- Decision Tree (ID3, C4.5, C5.0, CART, ….)
- Nearest Neighbors
- Gaussian Processes
- Naive Bayes (Gaussian, Bernoulli, Complement, ….)
- Ensembling: (Random Forest, GBM, AdaBoost, ….)
Motivation: Model Building

9
Model Building
4.
Algorithm
Selection
5.
Parameter
Tuning
Kernel
Linear RBF Polynomial
Gamma
[2^-15, 2^3]
Degree
2,3,....
C - Penalty
[2^-5, 2^15]
Example: Support Vector Machine
……..
Motivation: Model Building

10
Motivation: Dimensionality Reduction
Examples of Feature Extraction:
1. Principal Component Analysis
2. Linear Discriminant Analysis
3. Multiple Discriminant Analysis
4. Independent Component
Analysis
Examples of Multivariate Feature Selection:
1. Relief
2. Correlation Feature Selection
3. Branch and Bound
4. Sequential Forward Selection
5. Plus L - Minus R
Examples of Univariate Feature Selection:
1. Information Gain
2. Fisher Score
3. Correlation with Target
2. Feature
Extraction
3. Feature
Selection

11
Motivation: Data PreProcessing
Examples of Data Preprocessors:
1. Scaling
2. Normalization
3. Standardization
4. Binarization
5. Imputation
6. Deletion
7. One-Hot-Encoding
8. Hashing
9. Discretization
1. Data
Preprocessing

Solution: Meta-Learning
1. Science of systematically observing how different machine learning
approaches perform on a wide range of learning tasks and then
learning from this experience.
12

Solution: Meta-Learning
2. It also allows to replace hand-written rules and algorithms with
novel approaches that are data-driven.
13
1. Science of systematically observing how different machine learning
approaches perform on a wide range of learning tasks and then
learning from this experience.

HOW ? Collect MetaData
1. Model Configurations:
- Pipeline Composition: (Normalization → PCA → SVM)
- Hyperparameter Settings: (PCA = 2 components, SVM = gamma: 1e-9, C = 1e2)
- Network Architectures: (2 Hidden Layers, 100 Neurons per layer)
2. Resulting Model Evaluations:
- Different Metrics: Accuracy, error rate, F1-Score.
- Training Time.
3. Task Itself (Meta-Features):
- Description of the data
14

HOW ? Use Meta-Data
1. Knowledge Transfer.
Use the same model as an initial
point and start to tune it.
2. Guided Search.
If Classifier X is worth than Classifier
Y by 10% then there is no need to
tune classifier X
15

HOW ? Use Meta-Data
Remember that Alex starts with the same approach that
succeeds in History courses.
Meta-Learning won’t be effective and may affect performance
badly in case of:
- Tasks with random noise, and unrelated phenomena.
“Tasks that are Never Seen Before”
16

Meta-Learning Methodologies:
1. Learning from Task Properties.
2. Learning from Model Evaluations.
3. Learning from Prior Models.
17

1-Learning from Task Properties:
● Represent task as a meta-feature vector.
● Studies show that optimal set of meta-features depends on application
type.[2]
● Different studies used various feature selection and extraction techniques
to reduce set of meta-features.[2][3]
18

● What are Task Properties? = Types of Meta-features:
1. Simple
2. Statistical
3. Information Theoretic
4. Complexity
5. LandMarkers
19

Meta-Features Types: (Simple)
● Examples:
1. Number of Instances
2. Number of Features
3. Number of Classes
4. Number of Missing Values
5. Number of Outliers
20

Meta-Features Types: (Statistical)
● Examples:
1. Skewness of Numerical Features.
2. Kurtosis of Numerical Features.
3. Correlation Covariance between features.
4. Variance in first PCA.
5. Skewness and Kurtosis of first PCA.
6. Class probability distribution.
7. Concentration, Sparsity, Gravity of Features
(Measurements of independence and
dispersion of values.)
21

Meta-Features Types: (Information theoretic)
● Examples:
1. Class Entropy.
2. Mutual Information between feature and
Class.
3. Equivalent number of features (2/1)
4. Noise to Signal ratio.
22

Meta-Features Types: (Task Complexity)
● Examples:
1. Fisher discriminant (Measure separability
between classes).
23

Meta-Features Types: (Landmarkers)
● Examples:
1. LandMarker 1NN.
2. LandMarker Decision Tree.
3. LandMarker Naive Bayes.
4. LandMarker Linear Discriminant Analysis.
24

How to use Meta-Features?
25
● Different Similarity Measurements (Unsupervised) and warm starting optimization of
similar tasks for recommendation of candidate configurations:
Examples:
1. Rank of different configurations.
- Tasks A, B are twin tasks.
- SVM and KNN are the best for Task A.
- Then, SVM and KNN are the best for Task B.

26
● Different Similarity Measurements (Unsupervised) and warm starting optimization of
similar tasks for recommendation of candidate configurations:
Examples:
2. Collaborative Filtering
Use results of few configurations on Task A to
predicts results of all other configurations based on
configurations results on a similar Task B
Knowledge Base Needs almost full
configurations results to be updated.

27

28
● Learning High Level Meta-Features
Low Level Features High Level Features
NEEDS BIG KNOWLEDGE BASE

29
● Meta-Models (Supervised): Learn the complex relationship between
meta-features and useful configurations in this large space.
Example:
- Ranking of Top N Promising Configurations:
Literature suggests Boosting, and Bagging Models [4][5].
+
Approximate Ranking Tree Forests [6] (Auto Meta-Feature Selection
based on some initial results).

30
● Pipeline Synthesis:
1. Meta-Model to predict which preprocessor with improve
performance of a specific classifier in that particular task. [7] [8]
2. Reinforcement Learning to construct pipeline by addition, deletion,
replacement of pipeline blocks. [9] (Alpha D3M - Evolutionary
Approach)

31
● Tune or Not to Tune:
Meta Models to predict:
1. How much improvement we can expect from tuning this particular
classifier on that particular task [10].
2. How much improvement VS additional time investment? [11].

2-Learning from Model Evaluations:
● Using Current configuration evaluations as a prior to suggesting the
coming candidate outperforming configuration in an iterative way.
32
Example:
1. Evaluate Px on Task 1
2. Suggest new Ps
3. Select most candidate outperforming P
4. Set Px = P
5. GO TO 1

How it is used?
● Task Independent Recommendation:
1. Discretize the search space into a set of configurations.
2. Apply over many datasets.
3. Aggregate single task rankings into a global ranking.
● Example: Scikit Learn Cheat Sheet Algorithm.
33

34

How it is used?
● Search Space Design:
1. Learn hyperparameter default values (Best configuration over all
tasks).
2. Learn different hyperparameters importance:
- Measure variance of algorithm performance by keeping all
hyperparameters fixed and change only one.
35

How it is used?
● Learning Curves: (Example: 1. Apply SVM Over 100 Training Datasets)
36

How it is used?
● Learning Curves: (Example: 2. Apply SVM Over New Dataset)
37

How it is used?
● Learning Curves: (Example: 3. Measure Similarity between training curves
and testing curve)
38

How it is used?
● Configuration Transfer:
- Surrogate Models: usually suitable with Gaussian Processes Bayesian Optimization like the
SMAC algorithm.
- We can define task similarity based on Learning Distribution Similarity between tasks too.
39

How it is used?
- Surrogate Models: usually suitable with Gaussian Processes like the SMAC algorithm.
We can define task similarity based of accuracy of predictions for
40

How it is used?
- Multi-armed bandits:
1. Start with small data portion and apply multiple
configurations on these small portions.
2. Drop lowest performing configurations and increase
portion size for other configurations.
41

How it is used?
- Multi-armed bandits:
42

3-Learning from Prior Models:
● Take already trained Models (Model HUB) to use for similar tasks.
● Suitable for few classifiers (Eg: Kernel Classifiers - Bayesian Networks)
BUT very good with Neural Networks. WHY?
Both Structure and Network Parameters can be a good initialization
for the target model.
43

References:
[1] Hutter Frank and Kotthoff Lars and Vanschoren Joaquin - Automated Machine Learning: Methods, Systems,
Challenges - (2019), Springer
[2] Bilalli, B., Abelló, A., Aluja-Banet, T.: On the predictive power of metafeatures in OpenML. International Journal of
Applied Mathematics and Computer Science 27(4), 697 – 712 (2017)
[3] Todorovski, L., Brazdil, P., Soares, C.: Report on the experiments with feature selection in meta-level learning.
PKDD 2000 Workshop on Data mining, Decision support, Meta-learning and ILP pp. 27–39 (2000)
[4] Pinto, F., Cerqueira, V., Soares, C., Mendes-Moreira, J.: autoBagging: Learning to rank bagging workflows with
metalearning. arXiv 1706.09367 (2017)
[5] Lorena, A.C., Maciel, A.I., de Miranda, P.B.C., Costa, I.G., Prudêncio, R.B.C.: Data complexity meta-features for
regression problems. Machine Learning 107(1), 209–246 (2018)
[6] Sun, Q., Pfahringer, B.: Pairwise meta-rules for better meta-learning based algorithm ranking. Machine Learning
93(1), 141–161 (2013)
[7] Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Computer
Standards & Interf. 57, 101 – 109 (2018)
[8] Schoenfeld, B., Giraud-Carrier, C., Poggeman, M., Christensen, J., Seppi, K.: Feature selection for high-dimensional
data: A fast correlation-based filter solution. In: AutoML Workshop at ICML (2018)
[9] Drori, I., Krishnamurthy, Y., Rampin, R., de Paula Lourenco, R., Ono, J.P., Cho, K., Silva, C., Freire, J.: AlphaD3M:
Machine learning pipeline synthesis. In: AutoML Workshop at ICML (2018)
[10] Ridd, P., Giraud-Carrier, C.: Using metalearning to predict when parameter optimization is likely to improve
classification accuracy. In: ECAI Workshop on Meta-learning and Algorithm Selection. pp. 18–23 (2014)
[11] Sanders, S., Giraud-Carrier, C.: Informing the use of hyperparameter optimization through metalearning. In: Proc.
ICDM. pp. 1051–1056 (2017) 44

45

Methods for meta learning in AutoML

More Related Content

What's hot (8)

Similar to Methods for meta learning in AutoML (20)

Recently uploaded (20)

Methods for meta learning in AutoML