Categories: Data Analytics

Tags:

Health Insurance Records – Complete Python EDA Tutorial Using Google Colab

Step-by-Step Data Analysis with Seaborn

By Ankit Srivastava

In this tutorial, I will guide you step by step on how to perform Exploratory Data Analysis (EDA) in Python using Google Colab on the Health Insurance Records dataset.

In our Power BI project, we analyzed Insurance_Cost as the target variable.
Now, we will do the same analysis — but using Python, Pandas, and Seaborn.

This is how real data analysts work:

Load data
Clean data
Create KPIs
Visualize patterns
Extract business insights

Let’s begin.

Get the Dataset here: https://www.kaggle.com/datasets/gdeepakreddy/insurance

🚀 Step 1: Open Google Colab

Go to https://colab.research.google.com
Click New Notebook
Rename it → Health_Insurance_EDA_Python.ipynb

📦 Step 2: Import Required Libraries

In the first cell, write:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10,6)

Why these libraries?

pandas → data handling
numpy → numerical operations
seaborn → statistical visualization
matplotlib → base plotting library

📂 Step 3: Upload Dataset in Colab

from google.colab import files
uploaded = files.upload()

Upload your Data.csv file.

Now load it:

df = pd.read_csv("Data.csv")
df.head()

🔎 Step 4: Initial Data Exploration

df.info()
df.describe()
df.isnull().sum()

Check:

Data types
Missing values
Numeric summary

If needed:

df.columns = df.columns.str.strip()

🎯 Step 5: Define Target Variable

Our target variable:

target = "Insurance_Cost"

📊 Step 6: Create 6 KPIs (Horizontal Table)

We will calculate:

Total Insurance Cost
Average Insurance Cost
Median Insurance Cost
Standard Deviation
Minimum Cost
Maximum Cost

total_cost = df[target].sum()
avg_cost = df[target].mean()
median_cost = df[target].median()
std_cost = df[target].std()
min_cost = df[target].min()
max_cost = df[target].max()

kpi_table = pd.DataFrame({
    "Total Cost": [total_cost],
    "Average Cost": [avg_cost],
    "Median Cost": [median_cost],
    "Std Deviation": [std_cost],
    "Minimum Cost": [min_cost],
    "Maximum Cost": [max_cost]
})

kpi_table

This will display KPIs horizontally in a single-row DataFrame.

📈 Step 7: 12 Seaborn Visualizations

Now comes the exciting part.

1️⃣ Distribution of Insurance Cost

sns.histplot(df[target], kde=True)
plt.title("Distribution of Insurance Cost")
plt.show()

Insight: Check if distribution is skewed.

2️⃣ Boxplot – Insurance Cost

sns.boxplot(x=df[target])
plt.title("Boxplot of Insurance Cost")
plt.show()

Helps detect outliers.

3️⃣ Insurance Cost by Gender

sns.boxplot(x="Gender", y=target, data=df)
plt.title("Insurance Cost by Gender")
plt.show()

4️⃣ Insurance Cost by Exercise

sns.barplot(x="exercise", y=target, data=df)
plt.title("Average Insurance Cost by Exercise")
plt.show()

5️⃣ Insurance Cost by Alcohol

sns.barplot(x="Alcohol", y=target, data=df)
plt.title("Insurance Cost by Alcohol Consumption")
plt.show()

6️⃣ Insurance Cost by Cholesterol Level

sns.barplot(x="cholesterol_level", y=target, data=df)
plt.title("Insurance Cost by Cholesterol Level")
plt.xticks(rotation=45)
plt.show()

7️⃣ Insurance Cost by Covered by Other Company

sns.barplot(x="Covered_by_Other", y=target, data=df)
plt.title("Insurance Cost by Coverage Status")
plt.show()

8️⃣ Insurance Cost by Location

sns.barplot(x="Location", y=target, data=df)
plt.title("Insurance Cost by Location")
plt.xticks(rotation=45)
plt.show()

9️⃣ Correlation Heatmap

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

Shows relationships between numeric features.

🔟 Pairplot

sns.pairplot(df, diag_kind="kde")
plt.show()

Helps visualize multi-variable relationships.

1️⃣1️⃣ BMI vs Insurance Cost (Regression)

sns.regplot(x="bmi", y=target, data=df)
plt.title("BMI vs Insurance Cost")
plt.show()

1️⃣2️⃣ Age vs Insurance Cost (Regression)

sns.regplot(x="age", y=target, data=df)
plt.title("Age vs Insurance Cost")
plt.show()

📌 Step 8: Extract Key Insights

After running all visuals, summarize:

Distribution likely right-skewed
Age positively correlated with cost
BMI moderately correlated
Certain cholesterol ranges higher cost
Gender differences visible
Exercise pattern interesting

EDA helps us understand patterns before modeling.

🎓 Why Seaborn?

Seaborn is powerful because:

Built-in statistical plotting
Clean themes
Easy aggregation
Regression plots
Heatmaps

It is perfect for EDA.

🏁 Final Thoughts from Ankit Srivastava

In this tutorial, we transformed a simple CSV file into:

6 Key Performance Indicators
12 powerful statistical visualizations
Correlation analysis
Behavioral insights

In the real world, Python EDA is the foundation of:

Machine Learning
Risk modeling
Pricing optimization
Healthcare analytics

Remember:

EDA is not about creating many charts.
EDA is about understanding data deeply.

If you master:

Pandas
Seaborn
Statistical thinking
Insight extraction

You can analyze any dataset confidently.

🚀 What Next?

In the next tutorial, we can:

Create feature engineering (Age Groups, BMI Categories)
Encode categorical variables
Build Linear Regression Model
Predict Insurance Cost
Compare Model Performance

Let me know if you want Part 2 – Predictive Modeling in Python.

—
Ankit Srivastava
Digital Project Manager | Data Analytics Mentor | IT Trainer