Tags:

Bank Customer Dataset EDA with Python Libraries
Introduction

Exploratory Data Analysis (EDA) is a critical first step in understanding any dataset, especially when dealing with real-world customer and financial data. In this project, we perform a structured EDA on a bank customer dataset to gain insights into customer characteristics, behavioral patterns, and key factors that influence banking-related outcomes. Before building predictive models or making strategic decisions, it is essential to explore the data thoroughly to ensure accuracy, consistency, and meaningful interpretation.

The dataset consists of multiple customer-level records representing demographic attributes, financial indicators, and banking-related information. Such datasets often contain a mix of numerical and categorical variables, making them ideal for applying descriptive statistics, data cleaning techniques, and visualization-driven analysis. Through EDA, we aim to understand the overall structure of the dataset, identify missing or inconsistent values, analyze distributions, and uncover relationships between important variables.

Python is used as the primary tool for this analysis due to its strong ecosystem for data handling and visualization. Libraries such as Pandas and NumPy allow efficient data manipulation and statistical exploration, while visualization libraries like Matplotlib and Seaborn help reveal trends, correlations, and anomalies that are not immediately visible in tabular form. These visual insights play a crucial role in simplifying complex data patterns and supporting data-driven conclusions.

The objective of this EDA is to transform raw banking data into actionable insights. By systematically exploring customer demographics, financial behavior, and key attributes, this analysis lays a strong foundation for further tasks such as customer segmentation, risk assessment, churn analysis, or predictive modeling. Ultimately, this process helps bridge the gap between raw data and informed business decision-making in the banking and financial services domain.

Dataset Explanation (Bank Customer Dataset)

Dataset Overview

Download the Dataset Click Here

This dataset contains bank customer information and marketing campaign data used to analyze customer behavior and predict whether a customer will subscribe to a term deposit. Each row represents a single customer record, combining demographic details, financial status, and past marketing interactions. The dataset is commonly used for EDA, classification problems, and customer segmentation analysis in the banking domain.

The primary objective of analyzing this dataset is to understand which customer attributes and campaign factors influence deposit subscription decisions.


Column-wise Explanation

Column NameMeaning
ageAge of the customer (numeric).
jobType of job the customer has (e.g., admin, technician, services).
maritalMarital status of the customer (married, single, divorced).
educationEducation level of the customer (primary, secondary, tertiary, unknown).
defaultWhether the customer has credit in default (yes/no).
balanceAverage yearly account balance in euros.
housingWhether the customer has a housing loan (yes/no).
loanWhether the customer has a personal loan (yes/no).
contactCommunication type used to contact the customer (cellular, telephone, unknown).
dayDay of the month when the customer was last contacted.
monthMonth of the year when the customer was last contacted.
durationDuration of the last contact in seconds (important campaign indicator).
campaignNumber of contacts performed during the current campaign for this customer.
pdaysNumber of days passed since the customer was last contacted in a previous campaign (-1 means not previously contacted).
previousNumber of contacts performed before this campaign.
poutcomeOutcome of the previous marketing campaign (success, failure, unknown).
depositTarget variable indicating whether the customer subscribed to a term deposit (yes/no).

Why This Dataset Is Suitable for EDA

  • Mix of numerical and categorical features
  • Clear business objective (deposit subscription)
  • Enables behavioral, demographic, and campaign-level analysis
  • Ideal for EDA → feature engineering → modeling pipeline

Lets Start the EDA Step By Step

Open Google Colab and rename your file as as Bank Customer Dataset EDA.

🔹 Import Required Libraries

import pandas as pd

This line imports Pandas to handle tabular data operations such as loading, filtering, and aggregation.


import numpy as np

This line imports NumPy to support numerical calculations and statistical analysis.


import matplotlib.pyplot as plt

This line enables plotting functionality for visual data exploration.


import seaborn as sns

This line allows us to create cleaner and more insightful statistical visualizations.


🔹 Load the Dataset

df = pd.read_csv("/content/bank.csv")

This line loads the bank customer dataset into a DataFrame so it can be explored and analyzed.


🔹 True Exploratory Data Analysis

df['age'].describe()

This line summarizes customer age to help understand the dominant age range in the dataset.


df['age'].skew()

This line checks whether the customer base is skewed toward younger or older age groups.


df['balance'].describe()

This line analyzes how customer account balances are distributed financially.


df['balance'].skew()

This line identifies whether a small number of customers hold disproportionately high balances.


df['job'].value_counts(normalize=True) * 100

This line calculates the percentage share of each profession in the customer base.


df.groupby('deposit')['balance'].mean()

This line compares average balances between customers who subscribed to a deposit and those who did not.


pd.crosstab(df['education'], df['deposit'], normalize='index') * 100

This line evaluates how education level influences deposit subscription decisions.


pd.crosstab(df['housing'], df['deposit'], normalize='index') * 100

This line examines whether having a housing loan affects the likelihood of deposit acceptance.


df.groupby('deposit')['duration'].mean()

This line analyzes whether longer customer interaction duration improves conversion chances.


pd.crosstab(df['poutcome'], df['deposit'], normalize='index') * 100

This line assesses how previous marketing campaign outcomes impact current customer behavior.


df.groupby('deposit')['campaign'].mean()

This line explores whether repeated campaign contacts increase or reduce subscription probability.

Visual Exploratory Data Analysis

🔹 Age Distribution

plt.figure(figsize=(8,5))
sns.histplot(df['age'], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()

This line visualizes how customer ages are distributed and highlights the most common age groups.


🔹 Balance Distribution

plt.figure(figsize=(8,5))
sns.histplot(df['balance'], bins=40, kde=True)
plt.title("Account Balance Distribution")
plt.show()

This line shows how customer balances are spread and helps identify skewness and extreme values.


🔹 Job Category vs Deposit Subscription

plt.figure(figsize=(10,5))
sns.countplot(data=df, x='job', hue='deposit')
plt.xticks(rotation=45)
plt.title("Job Type vs Deposit Subscription")
plt.show()

This line compares deposit subscription behavior across different job categories.


🔹 Education Level vs Deposit Subscription

plt.figure(figsize=(6,4))
sns.countplot(data=df, x='education', hue='deposit')
plt.title("Education vs Deposit Subscription")
plt.show()

This line visualizes how education level impacts customer subscription decisions.


🔹 Housing Loan Impact

plt.figure(figsize=(5,4))
sns.countplot(data=df, x='housing', hue='deposit')
plt.title("Housing Loan vs Deposit Subscription")
plt.show()

This line shows whether customers with housing loans behave differently in deposit acceptance.


🔹 Loan Status Impact

plt.figure(figsize=(5,4))
sns.countplot(data=df, x='loan', hue='deposit')
plt.title("Personal Loan vs Deposit Subscription")
plt.show()

This line analyzes how existing personal loans influence subscription outcomes.


🔹 Contact Duration vs Deposit

plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='deposit', y='duration')
plt.title("Contact Duration vs Deposit Outcome")
plt.show()

This line highlights how call duration differs between successful and unsuccessful conversions.


🔹 Campaign Frequency vs Deposit

plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='deposit', y='campaign')
plt.title("Campaign Contacts vs Deposit Outcome")
plt.show()

This line examines whether repeated campaign contacts improve or hurt conversion chances.


🔹 Previous Campaign Outcome Impact

plt.figure(figsize=(6,4))
sns.countplot(data=df, x='poutcome', hue='deposit')
plt.title("Previous Campaign Outcome vs Deposit")
plt.show()

This line visualizes how earlier campaign results affect current customer decisions.


🔹 Balance vs Age Relationship

plt.figure(figsize=(7,5))
sns.scatterplot(data=df, x='age', y='balance', hue='deposit')
plt.title("Age vs Balance by Deposit Outcome")
plt.show()