Bank Customer Dataset EDA - Python Libraries

Categories: Python Pandas Tutorial

Tags:

Bank Customer Dataset EDA – Python Libraries

Bank Customer Dataset EDA with Python Libraries
Introduction

Exploratory Data Analysis (EDA) is a critical first step in understanding any dataset, especially when dealing with real-world customer and financial data. In this project, we perform a structured EDA on a bank customer dataset to gain insights into customer characteristics, behavioral patterns, and key factors that influence banking-related outcomes. Before building predictive models or making strategic decisions, it is essential to explore the data thoroughly to ensure accuracy, consistency, and meaningful interpretation.

The dataset consists of multiple customer-level records representing demographic attributes, financial indicators, and banking-related information. Such datasets often contain a mix of numerical and categorical variables, making them ideal for applying descriptive statistics, data cleaning techniques, and visualization-driven analysis. Through EDA, we aim to understand the overall structure of the dataset, identify missing or inconsistent values, analyze distributions, and uncover relationships between important variables.

Python is used as the primary tool for this analysis due to its strong ecosystem for data handling and visualization. Libraries such as Pandas and NumPy allow efficient data manipulation and statistical exploration, while visualization libraries like Matplotlib and Seaborn help reveal trends, correlations, and anomalies that are not immediately visible in tabular form. These visual insights play a crucial role in simplifying complex data patterns and supporting data-driven conclusions.

The objective of this EDA is to transform raw banking data into actionable insights. By systematically exploring customer demographics, financial behavior, and key attributes, this analysis lays a strong foundation for further tasks such as customer segmentation, risk assessment, churn analysis, or predictive modeling. Ultimately, this process helps bridge the gap between raw data and informed business decision-making in the banking and financial services domain.

Dataset Explanation (Bank Customer Dataset)

Dataset Overview

Download the Dataset Click Here

This dataset contains bank customer information and marketing campaign data used to analyze customer behavior and predict whether a customer will subscribe to a term deposit. Each row represents a single customer record, combining demographic details, financial status, and past marketing interactions. The dataset is commonly used for EDA, classification problems, and customer segmentation analysis in the banking domain.

The primary objective of analyzing this dataset is to understand which customer attributes and campaign factors influence deposit subscription decisions.

Column-wise Explanation

Column Name	Meaning
age	Age of the customer (numeric).
job	Type of job the customer has (e.g., admin, technician, services).
marital	Marital status of the customer (married, single, divorced).
education	Education level of the customer (primary, secondary, tertiary, unknown).
default	Whether the customer has credit in default (yes/no).
balance	Average yearly account balance in euros.
housing	Whether the customer has a housing loan (yes/no).
loan	Whether the customer has a personal loan (yes/no).
contact	Communication type used to contact the customer (cellular, telephone, unknown).
day	Day of the month when the customer was last contacted.
month	Month of the year when the customer was last contacted.
duration	Duration of the last contact in seconds (important campaign indicator).
campaign	Number of contacts performed during the current campaign for this customer.
pdays	Number of days passed since the customer was last contacted in a previous campaign (-1 means not previously contacted).
previous	Number of contacts performed before this campaign.
poutcome	Outcome of the previous marketing campaign (success, failure, unknown).
deposit	Target variable indicating whether the customer subscribed to a term deposit (yes/no).

Why This Dataset Is Suitable for EDA

Mix of numerical and categorical features
Clear business objective (deposit subscription)
Enables behavioral, demographic, and campaign-level analysis
Ideal for EDA → feature engineering → modeling pipeline

Lets Start the EDA Step By Step

Open Google Colab and rename your file as as Bank Customer Dataset EDA.

🔹 Import Required Libraries

import pandas as pd

This line imports Pandas to handle tabular data operations such as loading, filtering, and aggregation.

import numpy as np

This line imports NumPy to support numerical calculations and statistical analysis.

import matplotlib.pyplot as plt

This line enables plotting functionality for visual data exploration.

import seaborn as sns

This line allows us to create cleaner and more insightful statistical visualizations.

🔹 Load the Dataset

df = pd.read_csv("/content/bank.csv")

This line loads the bank customer dataset into a DataFrame so it can be explored and analyzed.

🔹 True Exploratory Data Analysis

df['age'].describe()

This line summarizes customer age to help understand the dominant age range in the dataset.

df['age'].skew()

This line checks whether the customer base is skewed toward younger or older age groups.

df['balance'].describe()

This line analyzes how customer account balances are distributed financially.

df['balance'].skew()

This line identifies whether a small number of customers hold disproportionately high balances.

df['job'].value_counts(normalize=True) * 100

This line calculates the percentage share of each profession in the customer base.

df.groupby('deposit')['balance'].mean()

This line compares average balances between customers who subscribed to a deposit and those who did not.

pd.crosstab(df['education'], df['deposit'], normalize='index') * 100

This line evaluates how education level influences deposit subscription decisions.

pd.crosstab(df['housing'], df['deposit'], normalize='index') * 100

This line examines whether having a housing loan affects the likelihood of deposit acceptance.

df.groupby('deposit')['duration'].mean()

This line analyzes whether longer customer interaction duration improves conversion chances.

pd.crosstab(df['poutcome'], df['deposit'], normalize='index') * 100

This line assesses how previous marketing campaign outcomes impact current customer behavior.

df.groupby('deposit')['campaign'].mean()

This line explores whether repeated campaign contacts increase or reduce subscription probability.

Visual Exploratory Data Analysis

🔹 Age Distribution

plt.figure(figsize=(8,5))
sns.histplot(df['age'], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()

This line visualizes how customer ages are distributed and highlights the most common age groups.

🔹 Balance Distribution

plt.figure(figsize=(8,5))
sns.histplot(df['balance'], bins=40, kde=True)
plt.title("Account Balance Distribution")
plt.show()

This line shows how customer balances are spread and helps identify skewness and extreme values.

🔹 Job Category vs Deposit Subscription

plt.figure(figsize=(10,5))
sns.countplot(data=df, x='job', hue='deposit')
plt.xticks(rotation=45)
plt.title("Job Type vs Deposit Subscription")
plt.show()

This line compares deposit subscription behavior across different job categories.

🔹 Education Level vs Deposit Subscription

plt.figure(figsize=(6,4))
sns.countplot(data=df, x='education', hue='deposit')
plt.title("Education vs Deposit Subscription")
plt.show()

This line visualizes how education level impacts customer subscription decisions.

🔹 Housing Loan Impact

plt.figure(figsize=(5,4))
sns.countplot(data=df, x='housing', hue='deposit')
plt.title("Housing Loan vs Deposit Subscription")
plt.show()

This line shows whether customers with housing loans behave differently in deposit acceptance.

🔹 Loan Status Impact

plt.figure(figsize=(5,4))
sns.countplot(data=df, x='loan', hue='deposit')
plt.title("Personal Loan vs Deposit Subscription")
plt.show()

This line analyzes how existing personal loans influence subscription outcomes.

🔹 Contact Duration vs Deposit

plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='deposit', y='duration')
plt.title("Contact Duration vs Deposit Outcome")
plt.show()

This line highlights how call duration differs between successful and unsuccessful conversions.

🔹 Campaign Frequency vs Deposit

plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='deposit', y='campaign')
plt.title("Campaign Contacts vs Deposit Outcome")
plt.show()

This line examines whether repeated campaign contacts improve or hurt conversion chances.

🔹 Previous Campaign Outcome Impact

plt.figure(figsize=(6,4))
sns.countplot(data=df, x='poutcome', hue='deposit')
plt.title("Previous Campaign Outcome vs Deposit")
plt.show()

This line visualizes how earlier campaign results affect current customer decisions.

🔹 Balance vs Age Relationship

plt.figure(figsize=(7,5))
sns.scatterplot(data=df, x='age', y='balance', hue='deposit')
plt.title("Age vs Balance by Deposit Outcome")
plt.show()