Skip to content

This Python script analyzes a publicly available insurance dataset to predict the average insurance premium based on different customer features such as age, BMI, and smoking status. It demonstrates data cleaning, aggregation, analysis, and visualization techniques using Python libraries.

Notifications You must be signed in to change notification settings

alekhyak1/insurance-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Insurance-analysis

This Python script analyzes a publicly available insurance dataset to predict the average insurance premium based on different customer features such as age, BMI, and smoking status. It demonstrates data cleaning, aggregation, analysis, and visualization techniques using Python libraries.

Dataset Overview: Insurance Premium Prediction

The dataset used for this project comes from a publicly available source on Kaggle titled "Insurance Premium Prediction". The data contains various features related to individuals' health and demographic information, with the goal of understanding how these features impact the insurance premium they are charged.

Dataset Columns: age: The age of the individual (numeric). sex: Gender of the individual (male or female). bmi: Body Mass Index, a measure of body fat based on height and weight (numeric). children: Number of children/dependents covered by health insurance (numeric). smoker: Whether the individual is a smoker (yes or no). region: The individual’s residential region in the U.S. (northeast, southeast, southwest, northwest). charges: The insurance premium charged to the individual (numeric). Objective of the Analysis: The primary goal of the analysis is to explore how different features (such as age, BMI, and smoking status) impact the insurance premium charged (i.e., the charges column). By performing data exploration, analysis, and visualization, this code will provide insights into: The relationship between smoking and insurance premiums. The correlation between age, BMI, and insurance premiums. General distribution and behavior of insurance charges across different population groups.

Procedure:

  1. Loading and Exploring the Dataset: The dataset is loaded from a URL using Pandas' read_csv() function. The first few rows are displayed using df.head(), and descriptive statistics like mean, median, and standard deviation for the numerical variables are provided using df.describe() to understand the dataset better.
  2. Data Cleaning: A check for missing values is performed using df.isnull().sum(). This ensures the dataset is clean and complete, allowing for smooth data analysis without having to handle missing or invalid data points.
  3. Data Analysis: Average Insurance Charges by Smoking Status: A key aspect of insurance premiums is smoking status. Smokers generally incur higher healthcare costs, which affects their insurance premiums. The code calculates the average insurance charges for smokers versus non-smokers using the groupby() method. Correlation Analysis (Age, BMI, and Charges): Another important factor in premium calculation is an individual’s health, typically represented by their BMI and age. The correlation matrix shows how age, BMI, and insurance charges relate to each other. Correlation coefficients help indicate whether there’s a positive or negative relationship between variables, and this matrix is generated using Pandas' corr() function.
  4. Data Visualization: Histogram of Insurance Charges: A histogram shows the distribution of the charges column. This visualization helps understand the spread of insurance premiums across the population, highlighting any potential outliers or skewness in the data. Bar Plot for Average Charges by Smoking Status: A bar chart is generated using Seaborn's barplot() function to visualize the difference in average insurance charges between smokers and non-smokers. Heatmap for Correlation: A heatmap visualizes the correlation between age, BMI, and charges. This provides a clearer understanding of the strength of relationships between these features.

Results and Insights:

  1. Average Insurance Charges by Smoking Status: The bar plot showing average insurance charges for smokers vs non-smokers is expected to demonstrate a significant difference. Smokers typically face much higher premiums due to the increased health risks associated with smoking. Based on the dataset: Smokers will likely have a higher average insurance charge compared to non-smokers. This aligns with common industry practices, where insurance companies charge higher premiums to individuals who engage in risky behaviors, such as smoking.
  2. Correlation Between Age, BMI, and Charges: The correlation matrix is likely to reveal positive correlations between age, BMI, and insurance charges. Here’s what the matrix may indicate: Age and Charges: Older individuals often face higher health insurance premiums because they are statistically more likely to incur medical expenses. This should result in a positive correlation between age and charges. BMI and Charges: Higher BMI can be an indicator of health risks like obesity, which also drives up healthcare costs. Therefore, a positive correlation between BMI and insurance charges is expected. Age and BMI: The relationship between age and BMI is not as strongly correlated as charges, but any correlation could provide insights into general health trends among older individuals.
  3. Distribution of Insurance Charges: The histogram of insurance charges is likely to show a skewed distribution, with a majority of individuals paying moderate premiums and a smaller number paying significantly higher amounts. The higher premiums may correspond to individuals with multiple risk factors (e.g., smokers, higher BMI, or older age).

About

This Python script analyzes a publicly available insurance dataset to predict the average insurance premium based on different customer features such as age, BMI, and smoking status. It demonstrates data cleaning, aggregation, analysis, and visualization techniques using Python libraries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages