Purpose of EDA
The purpose of EDA on this dataset is to:
- Understand demographics (age, gender) of patients
- Identify cost drivers (treatment type, department, doctor)
- Evaluate hospital stay durations and their effect on cost/recovery
- Analyze recovery score patterns across departments and treatments
- Detect outliers or anomalies in treatment costs and stay duration
Dataset Link: https://colorstech.net/practice-datasets/hospital-patient-treatment-dataset-for-analysis/
✅ Steps for EDA using Pandas, Seaborn & Plotly
Here’s a structured EDA process:
1. Initial Setup
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# Load dataset
df = pd.read_csv('hospital_patient_treatment_dataset.csv')
# Check basic info
print(df.info())
print(df.describe())
print(df.head())
2. Check for Missing & Duplicate Data
# Missing values
print(df.isnull().sum())
# Duplicates
print(df.duplicated().sum())
3. Univariate Analysis
a. Categorical Variables
sns.countplot(x='Gender', data=df)
plt.title("Gender Distribution")
plt.show()
sns.countplot(y='Department', data=df, order=df['Department'].value_counts().index)
plt.title("Patients per Department")
plt.show()
b. Numerical Variables
sns.histplot(df['Age'], kde=True)
plt.title("Age Distribution")
plt.show()
sns.boxplot(y='Treatment Cost', data=df)
plt.title("Treatment Cost Distribution")
plt.show()
# Plotly for interactive recovery score
px.histogram(df, x='Recovery Score', nbins=20, title='Recovery Score Distribution')
4. Bivariate Analysis
a. Cost vs Stay
sns.scatterplot(x='Hospital Stay (Days)', y='Treatment Cost', hue='Gender', data=df)
plt.title("Cost vs Stay Duration")
plt.show()
b. Recovery by Department
plt.figure(figsize=(12,6))
sns.boxplot(x='Department', y='Recovery Score', data=df)
plt.xticks(rotation=45)
plt.title("Recovery Score by Department")
plt.show()
c. Cost by Treatment Type (Plotly)
px.box(df, x='Treatment Type', y='Treatment Cost', color='Treatment Type',
title='Treatment Cost by Type')
5. Multivariate Analysis
a. Heatmap of correlations
sns.heatmap(df[['Age', 'Treatment Cost', 'Hospital Stay (Days)', 'Recovery Score']].corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()
b. Age vs Recovery Score by Gender
px.scatter(df, x='Age', y='Recovery Score', color='Gender', trendline='ols',
title='Age vs Recovery Score (by Gender)')
6. Group-wise Aggregations
# Average cost per department
df.groupby('Department')['Treatment Cost'].mean().sort_values(ascending=False)
# Average recovery score by doctor
df.groupby('Doctor Name')['Recovery Score'].mean().sort_values(ascending=False)
7. Insights & Summary
After visualizing and analyzing, summarize:
- Which departments are most costly?
- Are older patients recovering slower?
- Does any doctor have consistently high recovery scores?
- Are longer stays always more expensive?
