Insurance Claims Prediction
According to Willis Towers , over 66% of insurance companies report that predictive analytics is useful in estimating their expenses and underwriting. Using machine learning, companies can accurately predict cost claims based on certain criteria.
The Problem
This project explores the use of machine learning in accurately predicting insurance claims from various clients using data obtained from kaggle.com. The company is looking to use the power of AI/Machine learning to predict the amount of claims based on available features
The prediction will focus on regression prediction methods such as linear regression, decision tree regressor, random forest, XGBoost and support machine regression. A comparison in performance based on certain criteria such as R-squared, and the best model was selected for building the final model.
The Dataset
The dataset contains the following features:
-
customer_ID - System-generated unique customerID
-
months_as_customer - number of months a the insured as been a customer
-
age - customer age
-
insured_education_level - Most recent customers educational qualification
-
insured_sex - gender
-
insured_occupation - occupation of insured
-
insured_hobbies - hobbies of insured
-
insured_relationship - insured relationship
-
capital-gains - capital gain
-
capital-loss - capital-loss
-
policy_number - policy_number
-
policy_bind_date - policy blind insurance coverage
-
policy_state -policy_state
-
policy_csl - policy_csl
-
policy_deductable -policy_deductable
-
incident_location - incident_location
-
incident_hour_of_the_day - incident_hour_of_the_day?
-
number_of_vehicles_involved -number_of_vehicles_involved
-
property_damage - property_damage
-
bodily_injuries - bodily_injuries
-
policy_annual_premium - policy_annual_premium
-
umbrella_limit - umbrella_limit
-
insured_zip -insured_zip
-
incident_date - incident_date
-
incident_type - incident_type
-
collision_type - collision_type
-
incident_severity - incident_severity
-
authorities_contacted - authorities_contacted
-
incident_state -incident_state
-
incident_city - incident_city
-
witnesses - witnesses
-
police_report_available - police_report_available
-
auto_make - auto_make
-
auto_model - auto_model
-
auto_year -auto_year
-
_c39 - _c39
-
total_claim_amount - total_claim_amount
Data Pre-Processing
1. The first step is to check for missing values. There are several ways to deal with missing value such as replacing with mean, median or mode or using regression method or other methods. Another way is to completely drop them from the data.
However, the dataset provided only contains one completely null column (_c39) is completely blank and therefore will be dropped
2. The next step involves checking each feature to ensure they are in the right data type and converting if necessary
3. A quick check at the categorical columns, reveal that some data just shows "?". This was replaced with the value "unknown".
Exploratory Data Analysis
Exploratory data analysis (EDA) is a method to analyze and understand features in a dataset. With EDA patterns, anomalies and outliers can be easily spotted
The pandas_profiling library can be easily used to provide basic data descriptive analysis. However, I will not use it in this project.
Several visualizations was used to explore the data for better understanding and modelling.
correlation analysis to better understand the relationship between features
box plots inspecting each categorical columns in relation to the total claim amount
checking out the distribution of the total claim amount and age. A quick glance shows that the age looks - normally distributed but there seems to be some sort of bi-modal distribution existing in the total claim amount. These features were converted to standard normal distribution for better accuracy
The following insights were gained from the EDA.
-
Both male and female have similar average total claims, but females have collected the highest amount of claim
-
The highest amount of claim have been collected with police report 2.
-
As expected, vehicles with trivial damage have generally collected the least claim, while vehicles that had total loss or major damage have a higher mean claim
-
Various car year show different effect on total claim amount. The 1995 & 1996 model car claim low amounts, but this could be due to the fact that this car are the least driven by people
-
Majority of the customers are between 250 - 290 months with the company
-
Minimum age of the customer is 19, max age is 64 and mean age is 39
-
Most of the customers are females (about 52.4%)
-
Most of the customers have education level of JD and high school. The least represented is PhD
-
Most of the customers are under the occupation 'machine-op-inspct'
-
The policy state is relatively evenly distributed between IL, OH and IN
-
The highest accidents occurred at 17:00 while the 2nd highest occurred during the Midnight. Generally the bulk of the accidents occurred between 21:00 and 00:00
-
Most of the accidents are self-accident
-
Rear collision seems the most common type of collision
-
Most of the cars suffered minor damages during the accident
-
The police are the most contacted body after accidents while fire and ambulances are second and third respectively. Seems most of the accidents are not life threatening
-
NY has the highest no in terms of incident occurrences
-
A lot of the cars involved in the accidents were made in 1995
-
Most of the customers claims occurs between 94-282 months after signing in
Feature Engineering
Feature engineering is basically a process involving the creation of more features (inputs) from existing features.
-
A new column (input) can be created by subtracting the difference between the policy_bind_date and incident_date
-
Features that would not be used for the analysis were dropped such as customer_id, zip code and policy number
-
One hot encoding was carried out to convert categorical features into numbers which can be easily processed and understood by machine learning model
-
The data was splitted into train and test set for training and testing, to determine the efficacy of the model.
Modelling
As this is a regression problem, various machine learning models were created and tested. The models used are
-
Linear regression
-
Random Forest
-
Decision tree regressor
-
XGBoost
Cross validation was used on each model to determine the best fitting model for the data.
cross_val_score splits the data into 15 folds and fits it 1different 5 times. Then for each fold it fits the data on 14 folds and scores the 15th fold. Then it gives you the 15 scores from which you can calculate a mean and variance for the score.
The Cross validation method is applied to the remaining models such as random forest, decision tree and XGBoost and the mean score is obtained.
Random forest has the best mean score and was therefore chosen as the best model
Conclusion
This article illustrated the use of machine learning algorithm to predict the insurance claims a customer is entitled to, using various input.
In subsequent articles, I intend to look into optimizing each model, to improve accuracy and ensure I am making the most of the model.
Click here to see the full code
Contact
I'm always looking for new and exciting opportunities in which i can utilize this skills to better decision making. Let's connect.