Insurance Claims Prediction

According to Willis Towers , over 66% of insurance companies report that predictive analytics is useful in estimating their expenses and underwriting. Using machine learning, companies can accurately predict cost claims based on certain criteria.

The Problem

This project explores the use of machine learning in accurately predicting insurance claims from various clients using data obtained from kaggle.com. The company is looking to use the power of AI/Machine learning to predict the amount of claims based on available features

The prediction will focus on regression prediction methods such as linear regression, decision tree regressor, random forest, XGBoost and support machine regression. A comparison in performance based on certain criteria such as R-squared, and the best model was selected for building the final model.

The Dataset

The dataset contains the following features:

customer_ID - System-generated unique customerID
months_as_customer - number of months a the insured as been a customer
age - customer age
insured_education_level - Most recent customers educational qualification
insured_sex - gender
insured_occupation - occupation of insured
insured_hobbies - hobbies of insured
insured_relationship - insured relationship
capital-gains - capital gain
capital-loss - capital-loss
policy_number - policy_number
policy_bind_date - policy blind insurance coverage
policy_state -policy_state
policy_csl - policy_csl
policy_deductable -policy_deductable
incident_location - incident_location
incident_hour_of_the_day - incident_hour_of_the_day?
number_of_vehicles_involved -number_of_vehicles_involved
property_damage - property_damage
bodily_injuries - bodily_injuries
policy_annual_premium - policy_annual_premium
umbrella_limit - umbrella_limit
insured_zip -insured_zip
incident_date - incident_date
incident_type - incident_type
collision_type - collision_type
incident_severity - incident_severity
authorities_contacted - authorities_contacted
incident_state -incident_state
incident_city - incident_city
witnesses - witnesses
police_report_available - police_report_available
auto_make - auto_make
auto_model - auto_model
auto_year -auto_year
_c39 - _c39
total_claim_amount - total_claim_amount

Data Pre-Processing

1. The first step is to check for missing values. There are several ways to deal with missing value such as replacing with mean, median or mode or using regression method or other methods. Another way is to completely drop them from the data.

However, the dataset provided only contains one completely null column (_c39) is completely blank and therefore will be dropped

2. The next step involves checking each feature to ensure they are in the right data type and converting if necessary

3. A quick check at the categorical columns, reveal that some data just shows "?". This was replaced with the value "unknown".

Exploratory Data Analysis

Exploratory data analysis (EDA) is a method to analyze and understand features in a dataset. With EDA patterns, anomalies and outliers can be easily spotted

The pandas_profiling library can be easily used to provide basic data descriptive analysis. However, I will not use it in this project.

Several visualizations was used to explore the data for better understanding and modelling.

correlation analysis to better understand the relationship between features

box plots inspecting each categorical columns in relation to the total claim amount

checking out the distribution of the total claim amount and age. A quick glance shows that the age looks - normally distributed but there seems to be some sort of bi-modal distribution existing in the total claim amount. These features were converted to standard normal distribution for better accuracy

The following insights were gained from the EDA.

Both male and female have similar average total claims, but females have collected the highest amount of claim
The highest amount of claim have been collected with police report 2.
As expected, vehicles with trivial damage have generally collected the least claim, while vehicles that had total loss or major damage have a higher mean claim
Various car year show different effect on total claim amount. The 1995 & 1996 model car claim low amounts, but this could be due to the fact that this car are the least driven by people
Majority of the customers are between 250 - 290 months with the company
Minimum age of the customer is 19, max age is 64 and mean age is 39
Most of the customers are females (about 52.4%)
Most of the customers have education level of JD and high school. The least represented is PhD
Most of the customers are under the occupation 'machine-op-inspct'
The policy state is relatively evenly distributed between IL, OH and IN
The highest accidents occurred at 17:00 while the 2nd highest occurred during the Midnight. Generally the bulk of the accidents occurred between 21:00 and 00:00
Most of the accidents are self-accident
Rear collision seems the most common type of collision
Most of the cars suffered minor damages during the accident
The police are the most contacted body after accidents while fire and ambulances are second and third respectively. Seems most of the accidents are not life threatening
NY has the highest no in terms of incident occurrences
A lot of the cars involved in the accidents were made in 1995
Most of the customers claims occurs between 94-282 months after signing in

Feature Engineering

Feature engineering is basically a process involving the creation of more features (inputs) from existing features.

A new column (input) can be created by subtracting the difference between the policy_bind_date and incident_date

Features that would not be used for the analysis were dropped such as customer_id, zip code and policy number
One hot encoding was carried out to convert categorical features into numbers which can be easily processed and understood by machine learning model
The data was splitted into train and test set for training and testing, to determine the efficacy of the model.

Modelling

As this is a regression problem, various machine learning models were created and tested. The models used are

Linear regression
Random Forest
Decision tree regressor
XGBoost

Cross validation was used on each model to determine the best fitting model for the data.

cross_val_score splits the data into 15 folds and fits it 1different 5 times. Then for each fold it fits the data on 14 folds and scores the 15th fold. Then it gives you the 15 scores from which you can calculate a mean and variance for the score.

The Cross validation method is applied to the remaining models such as random forest, decision tree and XGBoost and the mean score is obtained.

Random forest has the best mean score and was therefore chosen as the best model

Conclusion

This article illustrated the use of machine learning algorithm to predict the insurance claims a customer is entitled to, using various input.

In subsequent articles, I intend to look into optimizing each model, to improve accuracy and ensure I am making the most of the model.

Click here to see the full code

Click here to download report

Contact

I'm always looking for new and exciting opportunities in which i can utilize this skills to better decision making. Let's connect.