Categories: Data Analytics
Tags:

Step-by-Step Data Analysis with Seaborn

By Ankit Srivastava

In this tutorial, I will guide you step by step on how to perform Exploratory Data Analysis (EDA) in Python using Google Colab on the Health Insurance Records dataset.

In our Power BI project, we analyzed Insurance_Cost as the target variable.
Now, we will do the same analysis — but using Python, Pandas, and Seaborn.

This is how real data analysts work:

  1. Load data
  2. Clean data
  3. Create KPIs
  4. Visualize patterns
  5. Extract business insights

Let’s begin.

Get the Dataset here: https://www.kaggle.com/datasets/gdeepakreddy/insurance


🚀 Step 1: Open Google Colab

  1. Go to https://colab.research.google.com
  2. Click New Notebook
  3. Rename it → Health_Insurance_EDA_Python.ipynb

📦 Step 2: Import Required Libraries

In the first cell, write:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10,6)

Why these libraries?

  • pandas → data handling
  • numpy → numerical operations
  • seaborn → statistical visualization
  • matplotlib → base plotting library

📂 Step 3: Upload Dataset in Colab

from google.colab import files
uploaded = files.upload()

Upload your Data.csv file.

Now load it:

df = pd.read_csv("Data.csv")
df.head()

🔎 Step 4: Initial Data Exploration

df.info()
df.describe()
df.isnull().sum()

Check:

  • Data types
  • Missing values
  • Numeric summary

If needed:

df.columns = df.columns.str.strip()

🎯 Step 5: Define Target Variable

Our target variable:

target = "Insurance_Cost"

📊 Step 6: Create 6 KPIs (Horizontal Table)

We will calculate:

  1. Total Insurance Cost
  2. Average Insurance Cost
  3. Median Insurance Cost
  4. Standard Deviation
  5. Minimum Cost
  6. Maximum Cost
total_cost = df[target].sum()
avg_cost = df[target].mean()
median_cost = df[target].median()
std_cost = df[target].std()
min_cost = df[target].min()
max_cost = df[target].max()

kpi_table = pd.DataFrame({
    "Total Cost": [total_cost],
    "Average Cost": [avg_cost],
    "Median Cost": [median_cost],
    "Std Deviation": [std_cost],
    "Minimum Cost": [min_cost],
    "Maximum Cost": [max_cost]
})

kpi_table

This will display KPIs horizontally in a single-row DataFrame.


📈 Step 7: 12 Seaborn Visualizations

Now comes the exciting part.


1️⃣ Distribution of Insurance Cost

sns.histplot(df[target], kde=True)
plt.title("Distribution of Insurance Cost")
plt.show()

Insight: Check if distribution is skewed.


2️⃣ Boxplot – Insurance Cost

sns.boxplot(x=df[target])
plt.title("Boxplot of Insurance Cost")
plt.show()

Helps detect outliers.


3️⃣ Insurance Cost by Gender

sns.boxplot(x="Gender", y=target, data=df)
plt.title("Insurance Cost by Gender")
plt.show()

4️⃣ Insurance Cost by Exercise

sns.barplot(x="exercise", y=target, data=df)
plt.title("Average Insurance Cost by Exercise")
plt.show()

5️⃣ Insurance Cost by Alcohol

sns.barplot(x="Alcohol", y=target, data=df)
plt.title("Insurance Cost by Alcohol Consumption")
plt.show()

6️⃣ Insurance Cost by Cholesterol Level

sns.barplot(x="cholesterol_level", y=target, data=df)
plt.title("Insurance Cost by Cholesterol Level")
plt.xticks(rotation=45)
plt.show()

7️⃣ Insurance Cost by Covered by Other Company

sns.barplot(x="Covered_by_Other", y=target, data=df)
plt.title("Insurance Cost by Coverage Status")
plt.show()

8️⃣ Insurance Cost by Location

sns.barplot(x="Location", y=target, data=df)
plt.title("Insurance Cost by Location")
plt.xticks(rotation=45)
plt.show()

9️⃣ Correlation Heatmap

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

Shows relationships between numeric features.


🔟 Pairplot

sns.pairplot(df, diag_kind="kde")
plt.show()

Helps visualize multi-variable relationships.


1️⃣1️⃣ BMI vs Insurance Cost (Regression)

sns.regplot(x="bmi", y=target, data=df)
plt.title("BMI vs Insurance Cost")
plt.show()

1️⃣2️⃣ Age vs Insurance Cost (Regression)

sns.regplot(x="age", y=target, data=df)
plt.title("Age vs Insurance Cost")
plt.show()

📌 Step 8: Extract Key Insights

After running all visuals, summarize:

  • Distribution likely right-skewed
  • Age positively correlated with cost
  • BMI moderately correlated
  • Certain cholesterol ranges higher cost
  • Gender differences visible
  • Exercise pattern interesting

EDA helps us understand patterns before modeling.


🎓 Why Seaborn?

Seaborn is powerful because:

  • Built-in statistical plotting
  • Clean themes
  • Easy aggregation
  • Regression plots
  • Heatmaps

It is perfect for EDA.


🏁 Final Thoughts from Ankit Srivastava

In this tutorial, we transformed a simple CSV file into:

  • 6 Key Performance Indicators
  • 12 powerful statistical visualizations
  • Correlation analysis
  • Behavioral insights

In the real world, Python EDA is the foundation of:

  • Machine Learning
  • Risk modeling
  • Pricing optimization
  • Healthcare analytics

Remember:

EDA is not about creating many charts.
EDA is about understanding data deeply.

If you master:

  • Pandas
  • Seaborn
  • Statistical thinking
  • Insight extraction

You can analyze any dataset confidently.


🚀 What Next?

In the next tutorial, we can:

  • Create feature engineering (Age Groups, BMI Categories)
  • Encode categorical variables
  • Build Linear Regression Model
  • Predict Insurance Cost
  • Compare Model Performance

Let me know if you want Part 2 – Predictive Modeling in Python.


Ankit Srivastava
Digital Project Manager | Data Analytics Mentor | IT Trainer