Step-by-Step Data Analysis with Seaborn
By Ankit Srivastava
In this tutorial, I will guide you step by step on how to perform Exploratory Data Analysis (EDA) in Python using Google Colab on the Health Insurance Records dataset.
In our Power BI project, we analyzed Insurance_Cost as the target variable.
Now, we will do the same analysis — but using Python, Pandas, and Seaborn.
This is how real data analysts work:
- Load data
- Clean data
- Create KPIs
- Visualize patterns
- Extract business insights
Let’s begin.
Get the Dataset here: https://www.kaggle.com/datasets/gdeepakreddy/insurance
🚀 Step 1: Open Google Colab
- Go to https://colab.research.google.com
- Click New Notebook
- Rename it →
Health_Insurance_EDA_Python.ipynb
📦 Step 2: Import Required Libraries
In the first cell, write:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10,6)
Why these libraries?
- pandas → data handling
- numpy → numerical operations
- seaborn → statistical visualization
- matplotlib → base plotting library
📂 Step 3: Upload Dataset in Colab
from google.colab import files
uploaded = files.upload()
Upload your Data.csv file.
Now load it:
df = pd.read_csv("Data.csv")
df.head()
🔎 Step 4: Initial Data Exploration
df.info()
df.describe()
df.isnull().sum()
Check:
- Data types
- Missing values
- Numeric summary
If needed:
df.columns = df.columns.str.strip()
🎯 Step 5: Define Target Variable
Our target variable:
target = "Insurance_Cost"
📊 Step 6: Create 6 KPIs (Horizontal Table)
We will calculate:
- Total Insurance Cost
- Average Insurance Cost
- Median Insurance Cost
- Standard Deviation
- Minimum Cost
- Maximum Cost
total_cost = df[target].sum()
avg_cost = df[target].mean()
median_cost = df[target].median()
std_cost = df[target].std()
min_cost = df[target].min()
max_cost = df[target].max()
kpi_table = pd.DataFrame({
"Total Cost": [total_cost],
"Average Cost": [avg_cost],
"Median Cost": [median_cost],
"Std Deviation": [std_cost],
"Minimum Cost": [min_cost],
"Maximum Cost": [max_cost]
})
kpi_table
This will display KPIs horizontally in a single-row DataFrame.
📈 Step 7: 12 Seaborn Visualizations
Now comes the exciting part.
1️⃣ Distribution of Insurance Cost
sns.histplot(df[target], kde=True)
plt.title("Distribution of Insurance Cost")
plt.show()
Insight: Check if distribution is skewed.
2️⃣ Boxplot – Insurance Cost
sns.boxplot(x=df[target])
plt.title("Boxplot of Insurance Cost")
plt.show()
Helps detect outliers.
3️⃣ Insurance Cost by Gender
sns.boxplot(x="Gender", y=target, data=df)
plt.title("Insurance Cost by Gender")
plt.show()
4️⃣ Insurance Cost by Exercise
sns.barplot(x="exercise", y=target, data=df)
plt.title("Average Insurance Cost by Exercise")
plt.show()
5️⃣ Insurance Cost by Alcohol
sns.barplot(x="Alcohol", y=target, data=df)
plt.title("Insurance Cost by Alcohol Consumption")
plt.show()
6️⃣ Insurance Cost by Cholesterol Level
sns.barplot(x="cholesterol_level", y=target, data=df)
plt.title("Insurance Cost by Cholesterol Level")
plt.xticks(rotation=45)
plt.show()
7️⃣ Insurance Cost by Covered by Other Company
sns.barplot(x="Covered_by_Other", y=target, data=df)
plt.title("Insurance Cost by Coverage Status")
plt.show()
8️⃣ Insurance Cost by Location
sns.barplot(x="Location", y=target, data=df)
plt.title("Insurance Cost by Location")
plt.xticks(rotation=45)
plt.show()
9️⃣ Correlation Heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
Shows relationships between numeric features.
🔟 Pairplot
sns.pairplot(df, diag_kind="kde")
plt.show()
Helps visualize multi-variable relationships.
1️⃣1️⃣ BMI vs Insurance Cost (Regression)
sns.regplot(x="bmi", y=target, data=df)
plt.title("BMI vs Insurance Cost")
plt.show()
1️⃣2️⃣ Age vs Insurance Cost (Regression)
sns.regplot(x="age", y=target, data=df)
plt.title("Age vs Insurance Cost")
plt.show()
📌 Step 8: Extract Key Insights
After running all visuals, summarize:
- Distribution likely right-skewed
- Age positively correlated with cost
- BMI moderately correlated
- Certain cholesterol ranges higher cost
- Gender differences visible
- Exercise pattern interesting
EDA helps us understand patterns before modeling.
🎓 Why Seaborn?
Seaborn is powerful because:
- Built-in statistical plotting
- Clean themes
- Easy aggregation
- Regression plots
- Heatmaps
It is perfect for EDA.
🏁 Final Thoughts from Ankit Srivastava
In this tutorial, we transformed a simple CSV file into:
- 6 Key Performance Indicators
- 12 powerful statistical visualizations
- Correlation analysis
- Behavioral insights
In the real world, Python EDA is the foundation of:
- Machine Learning
- Risk modeling
- Pricing optimization
- Healthcare analytics
Remember:
EDA is not about creating many charts.
EDA is about understanding data deeply.
If you master:
- Pandas
- Seaborn
- Statistical thinking
- Insight extraction
You can analyze any dataset confidently.
🚀 What Next?
In the next tutorial, we can:
- Create feature engineering (Age Groups, BMI Categories)
- Encode categorical variables
- Build Linear Regression Model
- Predict Insurance Cost
- Compare Model Performance
Let me know if you want Part 2 – Predictive Modeling in Python.
—
Ankit Srivastava
Digital Project Manager | Data Analytics Mentor | IT Trainer
