Tags:

Purpose of EDA

The purpose of EDA on this dataset is to:

  • Understand demographics (age, gender) of patients
  • Identify cost drivers (treatment type, department, doctor)
  • Evaluate hospital stay durations and their effect on cost/recovery
  • Analyze recovery score patterns across departments and treatments
  • Detect outliers or anomalies in treatment costs and stay duration

Dataset Link: https://colorstech.net/practice-datasets/hospital-patient-treatment-dataset-for-analysis/


✅ Steps for EDA using Pandas, Seaborn & Plotly

Here’s a structured EDA process:


1. Initial Setup

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Load dataset
df = pd.read_csv('hospital_patient_treatment_dataset.csv')

# Check basic info
print(df.info())
print(df.describe())
print(df.head())

2. Check for Missing & Duplicate Data

# Missing values
print(df.isnull().sum())

# Duplicates
print(df.duplicated().sum())

3. Univariate Analysis

a. Categorical Variables

sns.countplot(x='Gender', data=df)
plt.title("Gender Distribution")
plt.show()

sns.countplot(y='Department', data=df, order=df['Department'].value_counts().index)
plt.title("Patients per Department")
plt.show()

b. Numerical Variables

sns.histplot(df['Age'], kde=True)
plt.title("Age Distribution")
plt.show()

sns.boxplot(y='Treatment Cost', data=df)
plt.title("Treatment Cost Distribution")
plt.show()

# Plotly for interactive recovery score
px.histogram(df, x='Recovery Score', nbins=20, title='Recovery Score Distribution')

4. Bivariate Analysis

a. Cost vs Stay

sns.scatterplot(x='Hospital Stay (Days)', y='Treatment Cost', hue='Gender', data=df)
plt.title("Cost vs Stay Duration")
plt.show()

b. Recovery by Department

plt.figure(figsize=(12,6))
sns.boxplot(x='Department', y='Recovery Score', data=df)
plt.xticks(rotation=45)
plt.title("Recovery Score by Department")
plt.show()

c. Cost by Treatment Type (Plotly)

px.box(df, x='Treatment Type', y='Treatment Cost', color='Treatment Type',
       title='Treatment Cost by Type')

5. Multivariate Analysis

a. Heatmap of correlations

sns.heatmap(df[['Age', 'Treatment Cost', 'Hospital Stay (Days)', 'Recovery Score']].corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()

b. Age vs Recovery Score by Gender

px.scatter(df, x='Age', y='Recovery Score', color='Gender', trendline='ols',
           title='Age vs Recovery Score (by Gender)')

6. Group-wise Aggregations

# Average cost per department
df.groupby('Department')['Treatment Cost'].mean().sort_values(ascending=False)

# Average recovery score by doctor
df.groupby('Doctor Name')['Recovery Score'].mean().sort_values(ascending=False)

7. Insights & Summary

After visualizing and analyzing, summarize:

  • Which departments are most costly?
  • Are older patients recovering slower?
  • Does any doctor have consistently high recovery scores?
  • Are longer stays always more expensive?