Bank Customer Dataset EDA with Python Libraries
Introduction
Exploratory Data Analysis (EDA) is a critical first step in understanding any dataset, especially when dealing with real-world customer and financial data. In this project, we perform a structured EDA on a bank customer dataset to gain insights into customer characteristics, behavioral patterns, and key factors that influence banking-related outcomes. Before building predictive models or making strategic decisions, it is essential to explore the data thoroughly to ensure accuracy, consistency, and meaningful interpretation.
The dataset consists of multiple customer-level records representing demographic attributes, financial indicators, and banking-related information. Such datasets often contain a mix of numerical and categorical variables, making them ideal for applying descriptive statistics, data cleaning techniques, and visualization-driven analysis. Through EDA, we aim to understand the overall structure of the dataset, identify missing or inconsistent values, analyze distributions, and uncover relationships between important variables.
Python is used as the primary tool for this analysis due to its strong ecosystem for data handling and visualization. Libraries such as Pandas and NumPy allow efficient data manipulation and statistical exploration, while visualization libraries like Matplotlib and Seaborn help reveal trends, correlations, and anomalies that are not immediately visible in tabular form. These visual insights play a crucial role in simplifying complex data patterns and supporting data-driven conclusions.
The objective of this EDA is to transform raw banking data into actionable insights. By systematically exploring customer demographics, financial behavior, and key attributes, this analysis lays a strong foundation for further tasks such as customer segmentation, risk assessment, churn analysis, or predictive modeling. Ultimately, this process helps bridge the gap between raw data and informed business decision-making in the banking and financial services domain.
Dataset Explanation (Bank Customer Dataset)
Dataset Overview
Download the Dataset Click Here
This dataset contains bank customer information and marketing campaign data used to analyze customer behavior and predict whether a customer will subscribe to a term deposit. Each row represents a single customer record, combining demographic details, financial status, and past marketing interactions. The dataset is commonly used for EDA, classification problems, and customer segmentation analysis in the banking domain.
The primary objective of analyzing this dataset is to understand which customer attributes and campaign factors influence deposit subscription decisions.
Column-wise Explanation
| Column Name | Meaning |
|---|---|
| age | Age of the customer (numeric). |
| job | Type of job the customer has (e.g., admin, technician, services). |
| marital | Marital status of the customer (married, single, divorced). |
| education | Education level of the customer (primary, secondary, tertiary, unknown). |
| default | Whether the customer has credit in default (yes/no). |
| balance | Average yearly account balance in euros. |
| housing | Whether the customer has a housing loan (yes/no). |
| loan | Whether the customer has a personal loan (yes/no). |
| contact | Communication type used to contact the customer (cellular, telephone, unknown). |
| day | Day of the month when the customer was last contacted. |
| month | Month of the year when the customer was last contacted. |
| duration | Duration of the last contact in seconds (important campaign indicator). |
| campaign | Number of contacts performed during the current campaign for this customer. |
| pdays | Number of days passed since the customer was last contacted in a previous campaign (-1 means not previously contacted). |
| previous | Number of contacts performed before this campaign. |
| poutcome | Outcome of the previous marketing campaign (success, failure, unknown). |
| deposit | Target variable indicating whether the customer subscribed to a term deposit (yes/no). |
Why This Dataset Is Suitable for EDA
- Mix of numerical and categorical features
- Clear business objective (deposit subscription)
- Enables behavioral, demographic, and campaign-level analysis
- Ideal for EDA → feature engineering → modeling pipeline
Lets Start the EDA Step By Step
Open Google Colab and rename your file as as Bank Customer Dataset EDA.
🔹 Import Required Libraries
import pandas as pd
This line imports Pandas to handle tabular data operations such as loading, filtering, and aggregation.
import numpy as np
This line imports NumPy to support numerical calculations and statistical analysis.
import matplotlib.pyplot as plt
This line enables plotting functionality for visual data exploration.
import seaborn as sns
This line allows us to create cleaner and more insightful statistical visualizations.
🔹 Load the Dataset
df = pd.read_csv("/content/bank.csv")
This line loads the bank customer dataset into a DataFrame so it can be explored and analyzed.
🔹 True Exploratory Data Analysis
df['age'].describe()
This line summarizes customer age to help understand the dominant age range in the dataset.
df['age'].skew()
This line checks whether the customer base is skewed toward younger or older age groups.
df['balance'].describe()
This line analyzes how customer account balances are distributed financially.
df['balance'].skew()
This line identifies whether a small number of customers hold disproportionately high balances.
df['job'].value_counts(normalize=True) * 100
This line calculates the percentage share of each profession in the customer base.
df.groupby('deposit')['balance'].mean()
This line compares average balances between customers who subscribed to a deposit and those who did not.
pd.crosstab(df['education'], df['deposit'], normalize='index') * 100
This line evaluates how education level influences deposit subscription decisions.
pd.crosstab(df['housing'], df['deposit'], normalize='index') * 100
This line examines whether having a housing loan affects the likelihood of deposit acceptance.
df.groupby('deposit')['duration'].mean()
This line analyzes whether longer customer interaction duration improves conversion chances.
pd.crosstab(df['poutcome'], df['deposit'], normalize='index') * 100
This line assesses how previous marketing campaign outcomes impact current customer behavior.
df.groupby('deposit')['campaign'].mean()
This line explores whether repeated campaign contacts increase or reduce subscription probability.
Visual Exploratory Data Analysis
🔹 Age Distribution
plt.figure(figsize=(8,5))
sns.histplot(df['age'], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()
This line visualizes how customer ages are distributed and highlights the most common age groups.
🔹 Balance Distribution
plt.figure(figsize=(8,5))
sns.histplot(df['balance'], bins=40, kde=True)
plt.title("Account Balance Distribution")
plt.show()
This line shows how customer balances are spread and helps identify skewness and extreme values.
🔹 Job Category vs Deposit Subscription
plt.figure(figsize=(10,5))
sns.countplot(data=df, x='job', hue='deposit')
plt.xticks(rotation=45)
plt.title("Job Type vs Deposit Subscription")
plt.show()
This line compares deposit subscription behavior across different job categories.
🔹 Education Level vs Deposit Subscription
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='education', hue='deposit')
plt.title("Education vs Deposit Subscription")
plt.show()
This line visualizes how education level impacts customer subscription decisions.
🔹 Housing Loan Impact
plt.figure(figsize=(5,4))
sns.countplot(data=df, x='housing', hue='deposit')
plt.title("Housing Loan vs Deposit Subscription")
plt.show()
This line shows whether customers with housing loans behave differently in deposit acceptance.
🔹 Loan Status Impact
plt.figure(figsize=(5,4))
sns.countplot(data=df, x='loan', hue='deposit')
plt.title("Personal Loan vs Deposit Subscription")
plt.show()
This line analyzes how existing personal loans influence subscription outcomes.
🔹 Contact Duration vs Deposit
plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='deposit', y='duration')
plt.title("Contact Duration vs Deposit Outcome")
plt.show()
This line highlights how call duration differs between successful and unsuccessful conversions.
🔹 Campaign Frequency vs Deposit
plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='deposit', y='campaign')
plt.title("Campaign Contacts vs Deposit Outcome")
plt.show()
This line examines whether repeated campaign contacts improve or hurt conversion chances.
🔹 Previous Campaign Outcome Impact
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='poutcome', hue='deposit')
plt.title("Previous Campaign Outcome vs Deposit")
plt.show()
This line visualizes how earlier campaign results affect current customer decisions.
🔹 Balance vs Age Relationship
plt.figure(figsize=(7,5))
sns.scatterplot(data=df, x='age', y='balance', hue='deposit')
plt.title("Age vs Balance by Deposit Outcome")
plt.show()
