SlideShare a Scribd company logo
Modelling the expected loss of bodily injury claims using
gradient boosting
The following is an extract from a report on modelling the expected loss of bodily
injury claims using gradient boosting.
Summary
➢ Modelling the expected loss:
o Frequency model (classification)*
o Severity model (regression)
➢ The expected loss is given by multiplying the frequency and severity under the
assumption that they are independent**.
➢ Not so fast:
o The frequency model needs to take time into account
o Calculate a hazard rate
➢ Emphasis is on predictive accuracy
* Classification was used due to the nature of the policy and data – there could only be a claim (1) or no claim (0)
over the period examined. It was therefore a case of modeling the probability of a claim at an individual policy
holder level over the time period.
**There are important technical considerations here which are beyond the scope of this initial project. Various
approaches can and need to be considered when modelling the aggregate expected loss, and these approaches
can impact the modelling of the frequency and severity models. Such as the use of scenario analysis which
requires specification of the functional form of the frequency and severity models. One suggested approach is to
use Poisson frequency and Gamma distributed severity for individual claims so that the expected loss follows a
Tweedie compound Poisson distribution. See (Yang, Qian, Zou, 2014).
by Gregg Barrett
Overview of the modelling effort for the bodily injury claims data
A critical challenge in insurance is setting the premium for the policyholder. In a competitive market
an insurer needs to accurately price the expected loss of the policyholder. Failing to do so places the
insurer at the risk of adverse selection. Adverse selection is where the insurer loses profitable policies
and retains loss incurring policies resulting in economic loss. In personal car insurance for example,
this could occur if the insurer charged the same premium for old and young drivers. If the expected
loss for old drivers was significantly lower than that of the young drivers, the old drivers can be
expected to switch to a competitor leaving the insurer with a portfolio of young drivers (whom are
under-priced) incurring an economic loss.
In this project we have undertaken an attempt to accurately predict the expected loss for the
policyholder concerning a bodily injury claim. In doing so it is necessary to break down the process
into two distinct components: claim frequency and claim severity. For convenience and simplicity, we
have chosen to model the frequency and severity separately.
Other inputs into the premium setting (rating process) such as administrative costs, loadings, cost of
capital etc. have been omitted as we are only concerned with modelling the expected loss.
In modelling the claim frequency, a classification model will be used to model the probability of a
bodily injury claim given set of features that cover mostly vehicle characteristics. The actual claim
frequency for the dataset used in this project is around 1%.
In modelling the claim severity, a regression model will be used to model the expected claim amount
again using a set of features that cover mostly vehicle characteristics.
To ensure that the estimated performance of the model, as measured on the test sample, is an
accurate approximation of the expected performance on future ‘‘unseen’’ cases, the inception date
of policies in the test set is posterior to the policies used to train the model.
That dataset which covers the period from 2005 through 2007 was therefore split into three groups:
2005 through 2006 – Training set
2005 through 2006 – Validation set
2007 – Test set
An adjustment to the output of the claim frequency model is necessary in order to derive a probability
on an annual basis. This is due to the claim frequency being calculated over a period of two years
(2005 through 2006). For this project we assumed an exponential hazard function and adjusted the
claims frequency as follows:
P(t) = 1 - exp(-λ, T)
where:
P(t) = the annual probability of a claim
T = 1/2
λ = the probability of a claim predicted by the claim frequency model
In this project model validation is measured by the degree of predictive accuracy and this objective is
emphasized over producing interpretable models. The lack of interpretability in most algorithmic
models, appears to be a reason that their application to insurance pricing problems has been very
limited so far. (Guelman, 2011).
In modelling the claim frequency, a ROC (Receiver Operator Characteristics) curve will be used to
assess model performance measuring the AUC (Area Under the Curve). In modelling the claim severity,
the RMSE (Root Mean Squared Error) will be used to assess model performance.
The test data was not used for model selection purposes, but purely to assess the generalization error
of the final chosen model. Assessing this error is broken down into three components:
1) Assessing the performance of the classification model on the test data using the AUC score.
2) Assessing the performance of the regression model on the test data using the RMSE.
3) Assessing the performance in predicting the expected loss by comparing the predicted
expected loss for the 2007 portfolio of policyholders against the realised loss for the 2007
portfolio of policyholders.
Gradient Boosting, often referred to as simply “boosting”, was selected as the modelling approach for
this project. Boosting is a general approach that can be applied to many statistical learning methods
for regression or classification. Boosting is a supervised, non-parametric machine learning approach.
(Geurts, Irrthum, Wehenkel, 2009). Supervised learning refers to the subset of machine learning
methods which derive models in the form of input-output relationships. More precisely, the goal of
supervised learning is to identify a mapping from some input variables to some output variables on
the sole basis of a given sample of joint observations of the values of these variables. Non-parametric
means that we do not make explicit assumptions about the functional form of f. Where the intent is
to find a function 𝑓̂such that Y ≈ 𝑓̂(X) for any observation (X, Y).
With boosting methods optimisation is held out in the function space. That is, we parameterise the
function estimate 𝑓̂in the additive functional form:
In this representation:
𝑓̂0 is the initial guess
M is the number of iterations
(𝑓̂𝑖)𝑖=1
𝑀
are the function increments, also referred to as the “boosts”
It is useful to distinguish between the parameterisation of the “base-learner” function and the overall
ensemble function estimate 𝑓̂(𝑥) known as the “loss function”.
Boosted models can be implemented with different base-learner functions. Common base-learner
functions include; linear models, smooth models, decision trees, and custom base-learner functions.
Several classes of base-learner models can be implemented in one boosted model. This means that
the same functional formula can include both smooth additive components and the decision trees
components at the same time. (Natekin, Knoll, 2013).
Loss functions can be classified according to the type of outcome variable, Y. For regression problems
Gaussian (minimizing squared error), Laplace (minimizing absolute error), and Huber are considerations,
while Bernoulli or Adaboost are consideration for classification. There are also loss functions for
survival models and count data.
This flexibility makes the boosting highly customizable to any particular data-driven task. It introduces
a lot of freedom into the model design thus making the choice of the most appropriate loss function
a matter of trial and error. (Natekin, Knoll, 2013).
To provide an intuitive explanation we will use an example of boosting in the context of decision trees
as was used in this project. Unlike fitting a single large decision tree to the training data, which
amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns
slowly. Given an initial model (decision tree), we fit a decision tree (the base-learner) to the residuals
from the initial model. That is, we fit a tree using the current residuals, rather than the outcome Y. We
then add this new decision tree into the fitted function in order to update the residuals. The process
is conducted sequentially so that at each particular iteration, a new weak, base-learner model is
trained with respect to the error of the whole ensemble learnt so far.
With such an approach the model structure is thus learned from data and not predetermined, thereby
avoiding an explicit model specification, and incorporating complex and higher order interactions to
reduce the potential modelling bias. (Yang, Qian, Zou, 2014).
The first choice in building the model involves selecting an appropriate loss function. Squared-error
loss was selected to define prediction error for the severity model and Bernoulli deviance was selected
for the frequency model.
There are three tuning parameters that need to be set:
- Shrinkage (the learning rate)
- Number of trees (the number of iterations)
- Depth (interaction depth)
The shrinkage parameter sets the learning rate of the base-learner models. In general, statistical
learning approaches that learn slowly tend to perform well. In boosting the construction of each tree
depends on the trees that have already been grown. Typical values are 0.01 or 0.001, and the right
choice can depend on the problem. (Ridgeway, 2012). It is important to know that smaller values of
shrinkage (almost) always give improved predictive performance. However, there are computational
costs, both storage and CPU time, associated with setting shrinkage to be low. The model with
shrinkage=0.001 will likely require ten times as many trees as the model with shrinkage=0.01,
increasing storage and computation time by a factor of 10.
It is generally the case that for small shrinkage parameters, 0.001 for example, there is a fairly long
plateau in which predictive performance is at its best. A recommended rule of thumb is to set
shrinkage as small as possible while still being able to fit the model in a reasonable amount of time
and storage. (Ridgeway, 2012).
Boosting can overfit if the number of trees is too large, although this overfitting tends to occur slowly
if at all. (James, Witten, Hastie, Tibshirani, 2013). Cross-validation and information criterion can be
used to select the number of trees. Again it is worth stressing that the optimal number of trees and
the shrinkage (learning rate) depend on each other, although slower learning rates do not necessarily
scale the number of optimal trees. That is, when shrinkage = 0.1 and the optimal number of tress =
100, does not necessarily imply that when shrinkage = 0.01 the optimal number of trees = 1000.
(Ridgeway, 2012).
Depth sets the number of splits in each tree, which controls the complexity of the boosted ensemble.
When depth = 1 each tree is a stump, consisting of a single split. In this case, the boosted ensemble is
fitting an additive model, since each term involves only a single variable. More generally depth is the
interaction depth, and controls the interaction order of the boosted model, since d splits can involve
at most d variables. (James, Witten, Hastie, Tibshirani, 2013).
A strength of tree based methods is that single depth tress are readily understandable and
interpretable. In addition, decision trees have the ability to select or rank the attributes according to
their relevance for predicting the output, a feature that is shared with almost no other non-parametric
methods. (Geurts, Irrthum, Wehenkel, 2009).
From the point of view of their statistical properties, tree-based methods are non-parametric universal
approximators, meaning that, with sufficient complexity, a tree can represent any continuous function
with an arbitrary high precision. When used with numerical attributes, they are invariant with respect
to monotone transformations of the input attributes. (Geurts, Irrthum, Wehenkel, 2009).
Importantly boosted decision trees require very little data pre-processing, which can easily be one of
the most time consuming activities in a project of this nature. As boosted decision trees handle the
predictor and response variables of any type without the need for transformation, and are insensitive
to outliers and missing values, it is natural choice not only for this project but for insurance in general
where there are frequently a large number of categorical and numerical predictors, non-linearities
and complex interactions, as well as missing values that all need to be modelled.
Lastly, the techniques used in this project can be applied independently of the limitations imposed by
any specific legislation.
Potential Improvements
Below are several suggestions for improving the initial model.
Specification
A careful specification of the loss function leads to the estimation of any desired characteristic of the
conditional distribution of the response. This coupled with the large number of base learners
guarantees a rich set of models that can be addressed by boosting. (Hofner, Mayr, Robinzonovz,
Schmid, 2014)
AUC loss function for the classification model
For the classification model AUC can be tested as a loss function to optimize the area under the ROC
curve.
Huber loss function for the regression model
The Huber loss function can be used as a robust alternative to the L2 (least squares error) loss.
Where:
𝜌 is the loss function
δ is the parameter that limits the outliers which are subject to absolute error loss
Quantile loss function for the regression model:
Another alternative for settings with continuous response is modeling conditional quantiles through
quantile regression (Koenker 2005). The main advantage of quantile regression is (beyond its
robustness towards outliers) that it does not rely on any distributional assumptions on the response
or the error terms. (Hofner, Mayr, Robinzonovz, Schmid, 2014)
Laplace loss function for the regression model:
The Laplace loss function is the function of choice if we are interested in the median of the conditional
distribution. It implements a distribution free, median regression approach especially useful for long-
tailed error distributions.
The loss function allows flexible specification of the link between the response and the covariates. The figure on
the left hand side illustrates the L2 loss, the figure on the right hand side illustrates the L1 (least absolute
deviation) loss function.
All of the above listed loss functions can be implemented within the “mboost” package. The table
below provides an overview of some of the currently available loss functions within the mboost
package.
An overview on the currently implemented families in mboost.
Optimal number of iterations using AIC:
To maximise predictive power and to prevent overfitting it is important that the optimal stopping
iteration is carefully chosen. Various possibilities to determine the stopping iteration exist. AIC was
considered however this is usually not recommended as AIC-based stopping tends to overshoot the
optimal stopping dramatically. (Hofner, Mayr, Robinzonovz, Schmid, 2014)
Package
Package xgboost
The package “xgboost” was also tested during this project. It’s benefit of over the gbm and mboost
package is that it is purportedly faster. It should be noted that xgboost requires the data to be in the
form of numeric vectors and thus necessitates some additional data preparation. It was also found to
be a little more challenging to implement as opposed to the gbm and mboost packages.
Reference
Geurts, P., Irrthum, A., Wehenkel, L. (2009). Supervised learning with decision tree-based methods in
computational and systems biology. [pdf]. Retrieved from
http://www.montefiore.ulg.ac.be/~geurts/Papers/geurts09-molecularbiosystems.pdf
Guelman, L. (2011). Gradient boosting trees for auto insurance loss cost modeling and prediction.
[pdf]. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417411013674
Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). Model-based boosting in r: a hands-on
tutorial using the r package mboost. [pdf]. Retrieved from https://cran.r-
project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf
Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). An overview on the currently implemented families
in mboost. [table]. Retrieved from Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). Model-
based boosting in r: a hands-on tutorial using the r package mboost. [pdf]. Retrieved from
https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). Introduction to statistical learning with applications in R.
[ebook]. Retrieved from http://www-bcf.usc.edu/~gareth/ISL/getbook.html
Natekin, A., Knoll, A. (2013). Gradient boosting machines tutorial. [pdf]. Retrieved from
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/
Ridgeway, G. (2012). Generalized boosted models: a guide to the gbm package. [pdf]. Retrieved
from https://cran.r-project.org/web/packages/gbm/gbm.pdf
Yang, Y., Qian, W., Zou, H. (2014). A boosted nonparametric tweedie model for insurance premium.
[pdf]. Retrieved from https://people.rit.edu/wxqsma/papers/paper4

More Related Content

PDF
Kitamura1992
PDF
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
PDF
PPTX
Models of Operations Research is addressed
PDF
IRJET- Analyzing Voting Results using Influence Matrix
PPTX
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
PDF
Decision Support Systems in Clinical Engineering
PDF
Comparison between the genetic algorithms optimization and particle swarm opt...
Kitamura1992
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
Models of Operations Research is addressed
IRJET- Analyzing Voting Results using Influence Matrix
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Decision Support Systems in Clinical Engineering
Comparison between the genetic algorithms optimization and particle swarm opt...

What's hot (19)

DOCX
operation research notes
PPTX
Modeling strategies for definitive screening designs using jmp and r
PPTX
Operations Research - Models
PDF
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
PPTX
Models of Operational research, Advantages & disadvantages of Operational res...
DOC
A lognormal reliability design model
PDF
A robust multi criteria optimization approach
PDF
Data minning gaspar_2010
PDF
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PDF
Df24693697
PDF
Selection of Equipment by Using Saw and Vikor Methods
PPTX
Use of Definitive Screening Designs to Optimize an Analytical Method
DOC
Assignment oprations research luv
PDF
Guidelines to Understanding Design of Experiment and Reliability Prediction
PDF
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PPTX
Operation research ppt chapter one
DOCX
internship project1 report
PPTX
Churn Modeling For Mobile Telecommunications
PPTX
Operation research techniques
operation research notes
Modeling strategies for definitive screening designs using jmp and r
Operations Research - Models
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Models of Operational research, Advantages & disadvantages of Operational res...
A lognormal reliability design model
A robust multi criteria optimization approach
Data minning gaspar_2010
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
Df24693697
Selection of Equipment by Using Saw and Vikor Methods
Use of Definitive Screening Designs to Optimize an Analytical Method
Assignment oprations research luv
Guidelines to Understanding Design of Experiment and Reliability Prediction
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
Operation research ppt chapter one
internship project1 report
Churn Modeling For Mobile Telecommunications
Operation research techniques
Ad

Similar to Modelling the expected loss of bodily injury claims using gradient boosting (20)

DOC
Loss Forecastiing_A sequential approach
PDF
Off policy evaluation
PDF
Paper-Allstate-Claim-Severity
PDF
Bel ventutorial hetero
PPTX
WEKA:Credibility Evaluating Whats Been Learned
PPTX
WEKA: Credibility Evaluating Whats Been Learned
PPT
Intro to Feature Selection
PDF
Presentation about the Linear Regression.pdf
PPT
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
PPT
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
PPTX
Introduction to TreeNet (2004)
PPT
Download the presentation
PDF
working with python
PDF
A Solution Manual and Notes for The Elements of Statistical Learning.pdf
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
Introduction to Supervised ML Concepts and Algorithms
PDF
Multi-class Bio-images Classification
PDF
Machine learning (1)
PPTX
SERIES 7 Results and Interpretation II (1)_ika.pptx
PPTX
Model validation
Loss Forecastiing_A sequential approach
Off policy evaluation
Paper-Allstate-Claim-Severity
Bel ventutorial hetero
WEKA:Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
Intro to Feature Selection
Presentation about the Linear Regression.pdf
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
Introduction to TreeNet (2004)
Download the presentation
working with python
A Solution Manual and Notes for The Elements of Statistical Learning.pdf
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Introduction to Supervised ML Concepts and Algorithms
Multi-class Bio-images Classification
Machine learning (1)
SERIES 7 Results and Interpretation II (1)_ika.pptx
Model validation
Ad

More from Gregg Barrett (20)

PDF
Cirrus: Africa's AI initiative, Proposal 2018
PDF
Cirrus: Africa's AI initiative
PDF
Applied machine learning: Insurance
DOCX
Road and Track Vehicle - Project Document
PDF
Data Science Introduction - Data Science: What Art Thou?
PDF
Revenue Generation Ideas for Tesla Motors
PDF
Data science unit introduction
PDF
Social networking brings power
PDF
Procurement can be exciting
PDF
Machine Learning Approaches to Brewing Beer
PDF
A note to Data Science and Machine Learning managers
PDF
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
PDF
Efficient equity portfolios using mean variance optimisation in R
PDF
Hadoop Overview
PDF
Variable selection for classification and regression using R
PDF
Diabetes data - model assessment using R
PDF
Introduction to Microsoft R Services
PDF
Insurance metrics overview
PDF
Review of mit sloan management review case study on analytics at Intermountain
PDF
Example: movielens data with mahout
Cirrus: Africa's AI initiative, Proposal 2018
Cirrus: Africa's AI initiative
Applied machine learning: Insurance
Road and Track Vehicle - Project Document
Data Science Introduction - Data Science: What Art Thou?
Revenue Generation Ideas for Tesla Motors
Data science unit introduction
Social networking brings power
Procurement can be exciting
Machine Learning Approaches to Brewing Beer
A note to Data Science and Machine Learning managers
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Efficient equity portfolios using mean variance optimisation in R
Hadoop Overview
Variable selection for classification and regression using R
Diabetes data - model assessment using R
Introduction to Microsoft R Services
Insurance metrics overview
Review of mit sloan management review case study on analytics at Intermountain
Example: movielens data with mahout

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Lecture1 pattern recognition............
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Introduction to the R Programming Language
PDF
Business Analytics and business intelligence.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Leprosy and NLEP programme community medicine
PDF
Transcultural that can help you someday.
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
DATA COLLECTION METHODS-ppt for nursing research
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Lecture1 pattern recognition............
IBA_Chapter_11_Slides_Final_Accessible.pptx
importance of Data-Visualization-in-Data-Science. for mba studnts
SAP 2 completion done . PRESENTATION.pptx
Introduction to the R Programming Language
Business Analytics and business intelligence.pdf
Introduction to Data Science and Data Analysis
Leprosy and NLEP programme community medicine
Transcultural that can help you someday.
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Introduction-to-Cloud-ComputingFinal.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf

Modelling the expected loss of bodily injury claims using gradient boosting

  • 1. Modelling the expected loss of bodily injury claims using gradient boosting
  • 2. The following is an extract from a report on modelling the expected loss of bodily injury claims using gradient boosting. Summary ➢ Modelling the expected loss: o Frequency model (classification)* o Severity model (regression) ➢ The expected loss is given by multiplying the frequency and severity under the assumption that they are independent**. ➢ Not so fast: o The frequency model needs to take time into account o Calculate a hazard rate ➢ Emphasis is on predictive accuracy * Classification was used due to the nature of the policy and data – there could only be a claim (1) or no claim (0) over the period examined. It was therefore a case of modeling the probability of a claim at an individual policy holder level over the time period. **There are important technical considerations here which are beyond the scope of this initial project. Various approaches can and need to be considered when modelling the aggregate expected loss, and these approaches can impact the modelling of the frequency and severity models. Such as the use of scenario analysis which requires specification of the functional form of the frequency and severity models. One suggested approach is to use Poisson frequency and Gamma distributed severity for individual claims so that the expected loss follows a Tweedie compound Poisson distribution. See (Yang, Qian, Zou, 2014). by Gregg Barrett
  • 3. Overview of the modelling effort for the bodily injury claims data A critical challenge in insurance is setting the premium for the policyholder. In a competitive market an insurer needs to accurately price the expected loss of the policyholder. Failing to do so places the insurer at the risk of adverse selection. Adverse selection is where the insurer loses profitable policies and retains loss incurring policies resulting in economic loss. In personal car insurance for example, this could occur if the insurer charged the same premium for old and young drivers. If the expected loss for old drivers was significantly lower than that of the young drivers, the old drivers can be expected to switch to a competitor leaving the insurer with a portfolio of young drivers (whom are under-priced) incurring an economic loss. In this project we have undertaken an attempt to accurately predict the expected loss for the policyholder concerning a bodily injury claim. In doing so it is necessary to break down the process into two distinct components: claim frequency and claim severity. For convenience and simplicity, we have chosen to model the frequency and severity separately. Other inputs into the premium setting (rating process) such as administrative costs, loadings, cost of capital etc. have been omitted as we are only concerned with modelling the expected loss. In modelling the claim frequency, a classification model will be used to model the probability of a bodily injury claim given set of features that cover mostly vehicle characteristics. The actual claim frequency for the dataset used in this project is around 1%. In modelling the claim severity, a regression model will be used to model the expected claim amount again using a set of features that cover mostly vehicle characteristics. To ensure that the estimated performance of the model, as measured on the test sample, is an accurate approximation of the expected performance on future ‘‘unseen’’ cases, the inception date of policies in the test set is posterior to the policies used to train the model. That dataset which covers the period from 2005 through 2007 was therefore split into three groups: 2005 through 2006 – Training set 2005 through 2006 – Validation set 2007 – Test set An adjustment to the output of the claim frequency model is necessary in order to derive a probability on an annual basis. This is due to the claim frequency being calculated over a period of two years (2005 through 2006). For this project we assumed an exponential hazard function and adjusted the claims frequency as follows: P(t) = 1 - exp(-λ, T) where: P(t) = the annual probability of a claim T = 1/2 λ = the probability of a claim predicted by the claim frequency model In this project model validation is measured by the degree of predictive accuracy and this objective is emphasized over producing interpretable models. The lack of interpretability in most algorithmic
  • 4. models, appears to be a reason that their application to insurance pricing problems has been very limited so far. (Guelman, 2011). In modelling the claim frequency, a ROC (Receiver Operator Characteristics) curve will be used to assess model performance measuring the AUC (Area Under the Curve). In modelling the claim severity, the RMSE (Root Mean Squared Error) will be used to assess model performance. The test data was not used for model selection purposes, but purely to assess the generalization error of the final chosen model. Assessing this error is broken down into three components: 1) Assessing the performance of the classification model on the test data using the AUC score. 2) Assessing the performance of the regression model on the test data using the RMSE. 3) Assessing the performance in predicting the expected loss by comparing the predicted expected loss for the 2007 portfolio of policyholders against the realised loss for the 2007 portfolio of policyholders. Gradient Boosting, often referred to as simply “boosting”, was selected as the modelling approach for this project. Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Boosting is a supervised, non-parametric machine learning approach. (Geurts, Irrthum, Wehenkel, 2009). Supervised learning refers to the subset of machine learning methods which derive models in the form of input-output relationships. More precisely, the goal of supervised learning is to identify a mapping from some input variables to some output variables on the sole basis of a given sample of joint observations of the values of these variables. Non-parametric means that we do not make explicit assumptions about the functional form of f. Where the intent is to find a function 𝑓̂such that Y ≈ 𝑓̂(X) for any observation (X, Y). With boosting methods optimisation is held out in the function space. That is, we parameterise the function estimate 𝑓̂in the additive functional form: In this representation: 𝑓̂0 is the initial guess M is the number of iterations (𝑓̂𝑖)𝑖=1 𝑀 are the function increments, also referred to as the “boosts” It is useful to distinguish between the parameterisation of the “base-learner” function and the overall ensemble function estimate 𝑓̂(𝑥) known as the “loss function”. Boosted models can be implemented with different base-learner functions. Common base-learner functions include; linear models, smooth models, decision trees, and custom base-learner functions. Several classes of base-learner models can be implemented in one boosted model. This means that the same functional formula can include both smooth additive components and the decision trees components at the same time. (Natekin, Knoll, 2013). Loss functions can be classified according to the type of outcome variable, Y. For regression problems Gaussian (minimizing squared error), Laplace (minimizing absolute error), and Huber are considerations,
  • 5. while Bernoulli or Adaboost are consideration for classification. There are also loss functions for survival models and count data. This flexibility makes the boosting highly customizable to any particular data-driven task. It introduces a lot of freedom into the model design thus making the choice of the most appropriate loss function a matter of trial and error. (Natekin, Knoll, 2013). To provide an intuitive explanation we will use an example of boosting in the context of decision trees as was used in this project. Unlike fitting a single large decision tree to the training data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given an initial model (decision tree), we fit a decision tree (the base-learner) to the residuals from the initial model. That is, we fit a tree using the current residuals, rather than the outcome Y. We then add this new decision tree into the fitted function in order to update the residuals. The process is conducted sequentially so that at each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far. With such an approach the model structure is thus learned from data and not predetermined, thereby avoiding an explicit model specification, and incorporating complex and higher order interactions to reduce the potential modelling bias. (Yang, Qian, Zou, 2014). The first choice in building the model involves selecting an appropriate loss function. Squared-error loss was selected to define prediction error for the severity model and Bernoulli deviance was selected for the frequency model. There are three tuning parameters that need to be set: - Shrinkage (the learning rate) - Number of trees (the number of iterations) - Depth (interaction depth) The shrinkage parameter sets the learning rate of the base-learner models. In general, statistical learning approaches that learn slowly tend to perform well. In boosting the construction of each tree depends on the trees that have already been grown. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. (Ridgeway, 2012). It is important to know that smaller values of shrinkage (almost) always give improved predictive performance. However, there are computational costs, both storage and CPU time, associated with setting shrinkage to be low. The model with shrinkage=0.001 will likely require ten times as many trees as the model with shrinkage=0.01, increasing storage and computation time by a factor of 10. It is generally the case that for small shrinkage parameters, 0.001 for example, there is a fairly long plateau in which predictive performance is at its best. A recommended rule of thumb is to set shrinkage as small as possible while still being able to fit the model in a reasonable amount of time and storage. (Ridgeway, 2012). Boosting can overfit if the number of trees is too large, although this overfitting tends to occur slowly if at all. (James, Witten, Hastie, Tibshirani, 2013). Cross-validation and information criterion can be used to select the number of trees. Again it is worth stressing that the optimal number of trees and the shrinkage (learning rate) depend on each other, although slower learning rates do not necessarily scale the number of optimal trees. That is, when shrinkage = 0.1 and the optimal number of tress = 100, does not necessarily imply that when shrinkage = 0.01 the optimal number of trees = 1000. (Ridgeway, 2012).
  • 6. Depth sets the number of splits in each tree, which controls the complexity of the boosted ensemble. When depth = 1 each tree is a stump, consisting of a single split. In this case, the boosted ensemble is fitting an additive model, since each term involves only a single variable. More generally depth is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables. (James, Witten, Hastie, Tibshirani, 2013). A strength of tree based methods is that single depth tress are readily understandable and interpretable. In addition, decision trees have the ability to select or rank the attributes according to their relevance for predicting the output, a feature that is shared with almost no other non-parametric methods. (Geurts, Irrthum, Wehenkel, 2009). From the point of view of their statistical properties, tree-based methods are non-parametric universal approximators, meaning that, with sufficient complexity, a tree can represent any continuous function with an arbitrary high precision. When used with numerical attributes, they are invariant with respect to monotone transformations of the input attributes. (Geurts, Irrthum, Wehenkel, 2009). Importantly boosted decision trees require very little data pre-processing, which can easily be one of the most time consuming activities in a project of this nature. As boosted decision trees handle the predictor and response variables of any type without the need for transformation, and are insensitive to outliers and missing values, it is natural choice not only for this project but for insurance in general where there are frequently a large number of categorical and numerical predictors, non-linearities and complex interactions, as well as missing values that all need to be modelled. Lastly, the techniques used in this project can be applied independently of the limitations imposed by any specific legislation. Potential Improvements Below are several suggestions for improving the initial model. Specification A careful specification of the loss function leads to the estimation of any desired characteristic of the conditional distribution of the response. This coupled with the large number of base learners guarantees a rich set of models that can be addressed by boosting. (Hofner, Mayr, Robinzonovz, Schmid, 2014) AUC loss function for the classification model For the classification model AUC can be tested as a loss function to optimize the area under the ROC curve. Huber loss function for the regression model The Huber loss function can be used as a robust alternative to the L2 (least squares error) loss.
  • 7. Where: 𝜌 is the loss function δ is the parameter that limits the outliers which are subject to absolute error loss Quantile loss function for the regression model: Another alternative for settings with continuous response is modeling conditional quantiles through quantile regression (Koenker 2005). The main advantage of quantile regression is (beyond its robustness towards outliers) that it does not rely on any distributional assumptions on the response or the error terms. (Hofner, Mayr, Robinzonovz, Schmid, 2014) Laplace loss function for the regression model: The Laplace loss function is the function of choice if we are interested in the median of the conditional distribution. It implements a distribution free, median regression approach especially useful for long- tailed error distributions. The loss function allows flexible specification of the link between the response and the covariates. The figure on the left hand side illustrates the L2 loss, the figure on the right hand side illustrates the L1 (least absolute deviation) loss function. All of the above listed loss functions can be implemented within the “mboost” package. The table below provides an overview of some of the currently available loss functions within the mboost package.
  • 8. An overview on the currently implemented families in mboost. Optimal number of iterations using AIC: To maximise predictive power and to prevent overfitting it is important that the optimal stopping iteration is carefully chosen. Various possibilities to determine the stopping iteration exist. AIC was considered however this is usually not recommended as AIC-based stopping tends to overshoot the optimal stopping dramatically. (Hofner, Mayr, Robinzonovz, Schmid, 2014) Package Package xgboost The package “xgboost” was also tested during this project. It’s benefit of over the gbm and mboost package is that it is purportedly faster. It should be noted that xgboost requires the data to be in the form of numeric vectors and thus necessitates some additional data preparation. It was also found to be a little more challenging to implement as opposed to the gbm and mboost packages.
  • 9. Reference Geurts, P., Irrthum, A., Wehenkel, L. (2009). Supervised learning with decision tree-based methods in computational and systems biology. [pdf]. Retrieved from http://www.montefiore.ulg.ac.be/~geurts/Papers/geurts09-molecularbiosystems.pdf Guelman, L. (2011). Gradient boosting trees for auto insurance loss cost modeling and prediction. [pdf]. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417411013674 Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). Model-based boosting in r: a hands-on tutorial using the r package mboost. [pdf]. Retrieved from https://cran.r- project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). An overview on the currently implemented families in mboost. [table]. Retrieved from Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). Model- based boosting in r: a hands-on tutorial using the r package mboost. [pdf]. Retrieved from https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). Introduction to statistical learning with applications in R. [ebook]. Retrieved from http://www-bcf.usc.edu/~gareth/ISL/getbook.html Natekin, A., Knoll, A. (2013). Gradient boosting machines tutorial. [pdf]. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/ Ridgeway, G. (2012). Generalized boosted models: a guide to the gbm package. [pdf]. Retrieved from https://cran.r-project.org/web/packages/gbm/gbm.pdf Yang, Y., Qian, W., Zou, H. (2014). A boosted nonparametric tweedie model for insurance premium. [pdf]. Retrieved from https://people.rit.edu/wxqsma/papers/paper4