top of page

Insurance Claims Prediction

According to Willis Towers , over 66% of insurance companies report that predictive analytics is useful in estimating their expenses and underwriting. Using machine learning, companies can accurately predict cost claims based on certain criteria.

The Problem

This project explores the use of machine learning in accurately predicting insurance claims from various clients using data obtained from kaggle.com. The company is looking to use the power of AI/Machine learning to predict the amount of claims based on available features

The prediction will focus on regression prediction methods such as linear regression, decision tree regressor, random forest, XGBoost and support machine regression. A comparison in performance based on certain criteria such as R-squared, and the best model was selected for building the final model. 

The Dataset

The dataset contains the following features:

  • customer_ID - System-generated unique customerID

  • months_as_customer - number of months a the insured as been a customer

  • age - customer age

  • insured_education_level - Most recent customers educational qualification

  • insured_sex - gender

  • insured_occupation - occupation of insured

  • insured_hobbies - hobbies of insured

  • insured_relationship - insured relationship

  • capital-gains - capital gain

  • capital-loss - capital-loss

  • policy_number - policy_number

  • policy_bind_date - policy blind insurance coverage

  • policy_state -policy_state

  • policy_csl - policy_csl

  • policy_deductable -policy_deductable

  • incident_location - incident_location

  • incident_hour_of_the_day - incident_hour_of_the_day?

  • number_of_vehicles_involved -number_of_vehicles_involved

  • property_damage - property_damage

  • bodily_injuries - bodily_injuries

  • policy_annual_premium - policy_annual_premium

  • umbrella_limit - umbrella_limit

  • insured_zip -insured_zip

  • incident_date - incident_date

  • incident_type - incident_type

  • collision_type - collision_type

  • incident_severity - incident_severity

  • authorities_contacted - authorities_contacted

  • incident_state -incident_state

  • incident_city - incident_city

  • witnesses - witnesses

  • police_report_available - police_report_available

  • auto_make - auto_make

  • auto_model - auto_model

  • auto_year -auto_year

  • _c39 - _c39

  • total_claim_amount - total_claim_amount

Data Pre-Processing

1. The first step is to check for missing values. There are several ways to deal with missing value such as replacing with mean, median or mode or using regression method or other methods. Another way is to completely drop them from the data. 

However, the dataset provided only contains one completely null column (_c39) is completely blank and therefore will be dropped 

2. The next step involves checking each feature to ensure they are in the right data type and converting if necessary

3. A quick check at the categorical columns, reveal that some data just shows "?". This was replaced with the value "unknown". 

df_info.png
date time conversions.png
preprocessing 3.png

Exploratory Data Analysis

Exploratory data analysis (EDA) is a method to analyze and understand features in a dataset. With EDA patterns, anomalies and outliers can be easily spotted 

The pandas_profiling library can be easily used to provide basic data descriptive analysis. However, I will not use it in this project. 

Several visualizations was used to explore the data for better understanding and modelling. 

       correlation analysis to better understand the relationship between features 

       box plots inspecting each categorical columns in relation to the total claim amount

       checking out the distribution of the total claim amount and age. A quick glance shows that the age looks -             normally distributed but there seems to be some sort of bi-modal distribution existing in the total claim                   amount. These features were converted to standard normal distribution for better accuracy

The following insights were gained from the EDA.

  1. Both male and female have similar average total claims, but females have collected the highest amount of claim

  2. The highest amount of claim have been collected with police report  2.

  3. As expected, vehicles with trivial damage have generally collected the least claim, while vehicles that had total loss or major damage have a higher mean claim

  4. Various car year show different effect on total claim amount. The 1995 & 1996 model car claim low amounts, but this could be due to the fact that this car are the least driven by people

  5. Majority of the customers are between 250 - 290 months with the company 

  6. Minimum age of the customer is 19, max age is 64 and mean age is 39

  7.  Most of the customers are females (about 52.4%)

  8.  Most of the customers have education level of JD and high school. The least represented is PhD

  9.  Most of the customers are under the occupation 'machine-op-inspct'

  10.  The policy state is relatively evenly distributed between IL, OH and IN

  11.  The highest accidents occurred at 17:00 while the 2nd highest occurred during the Midnight. Generally the     bulk of the accidents occurred between 21:00 and 00:00

  12.  Most of the accidents are self-accident

  13.  Rear collision seems the most common type of collision

  14.  Most of the cars suffered minor damages during the accident

  15. The police are the most contacted body after accidents while fire and ambulances are second and third respectively. Seems most of the accidents are not life threatening 

  16. NY has the highest no in terms of incident occurrences

  17. A lot of the cars involved in the accidents were made in 1995

  18. Most of the customers claims occurs between 94-282 months after signing in

correlation.PNG
box plots.png
total_claim_amount.png
Age distribution.png

Feature Engineering 

Feature engineering is basically a process involving the creation of more features (inputs) from existing features. 

  • A new column (input) can be created by subtracting the difference between the policy_bind_date and incident_date 

  • Features that would not be used for the analysis were dropped such as customer_id, zip code and policy number

  • One hot encoding was carried out to convert categorical features into numbers which can be easily processed and understood by machine learning model

  • The data was splitted into train and test set for training and testing, to determine the efficacy of the model. 

feature engineering .png

Modelling

As this is a regression problem, various machine learning models were created and tested. The models used are 

  1. Linear regression

  2. Random Forest

  3. Decision tree regressor

  4. XGBoost

Cross validation was used on each model to determine the best fitting model for the data.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

cross_val_score  splits the data into 15 folds and fits it 1different 5 times. Then for each fold it fits the data on 14 folds and scores the 15th fold. Then it gives you the 15 scores from which you can calculate a mean and variance for the score.

The Cross validation method is applied to the remaining models such as random forest, decision tree and XGBoost and the mean score is obtained. 
 

Random forest has the best mean score and was therefore chosen as the best model

linear regression snippet.png

Conclusion

This article illustrated the use of machine learning algorithm to predict the insurance claims a customer is entitled to, using various input. 

In subsequent articles, I intend to look into optimizing each model, to improve accuracy and ensure I am making the most of the model. 

Click here to see the full code

Click here to download report

Contact

I'm always looking for new and exciting opportunities in which i can utilize this skills to better decision making. Let's connect.

bottom of page