Project & Dataset Explanation
Project Introduction
In this project, I will teach you how to perform Exploratory Data Analysis (EDA) on a real-world retail sales dataset using Python. The goal of this project is not just to analyze numbers, but to think like a data analyst working in a retail or e-commerce company.
Megamart is a multi-category retail business selling products across regions such as Electronics, Furniture, and Groceries, serving different customer segments like Retail, Corporate, and Wholesale. Every order generates sales, profit, and operational data, and this dataset captures exactly that.
Through this project, you will learn how to:
- Understand a raw business dataset before writing any analysis code
- Identify key business metrics such as sales, profit, quantity sold, and margins
- Prepare the dataset mentally for EDA and visualization
- Ask the right business questions from data (not just write Python code)
This project is designed especially for:
- Beginners learning Python EDA
- Students preparing for Data Analyst / Business Analyst roles
- Anyone creating EDA tutorial videos or practicing real datasets
Before we touch Python in DSP2, DSP1 focuses on dataset understanding, because bad understanding = bad analysis, no matter how good your code is.
Dataset Overview
The dataset is titled “Megamart Sales Performance Dataset”.
Each row represents one customer order placed at Megamart.
The dataset combines sales, customer, product, region, and profit information, making it ideal for end-to-end EDA.
The dataset contains the following columns:
Column-wise Explanation
Dataset : https://colorstech.net/wp-content/uploads/2025/03/Megamart_Sales_Performance_Dataset.xlsx
- Order ID
A unique identifier for each order.
This helps track individual transactions and ensures no duplication during analysis. - Order Date
The date on which the order was placed.
This column is crucial for time-based analysis, such as monthly sales trends or seasonal performance. - Customer ID
A unique code assigned to each customer.
Useful for identifying repeat customers and customer-level analysis. - Customer Name
The name of the customer who placed the order.
Mostly informational, but helpful for reports and dashboards. - Region
The geographical region where the order was placed, such as North, South, East, or West.
This column enables regional sales and profit comparison. - Product Category
High-level classification of products like Electronics, Furniture, or Groceries.
This helps analyze category-wise performance. - Product Name
The specific product sold, such as Laptop, Smartphone, Table, or Eggs.
Useful for product-level sales insights. - Quantity Sold
Number of units sold in that order.
This is important for understanding demand and volume-based performance. - Unit Price
Price of one unit of the product.
Combined with quantity, this helps calculate revenue. - Total Sales
Total revenue generated from the order.
Usually calculated as Quantity Sold × Unit Price. - Profit Margin (%)
Percentage profit earned on the sale.
This shows how profitable a product or category is. - Profit Amount
Actual profit earned from the order in monetary terms.
This is one of the most critical business metrics. - Payment Method
Mode of payment used, such as Credit Card, PayPal, or Bank Transfer.
Useful for payment behavior analysis. - Order Status
Status of the order, such as Completed, Pending, or Canceled.
Important for operational and fulfillment analysis. - Customer Segment
Type of customer: Retail, Corporate, or Wholesale.
Helps analyze which segment contributes most to revenue and profit.
Megamart Sales Performance Dataset | Python EDA
Step 1: Import Google-Colab-Friendly Libraries
import pandas as pd
👉 This line imports pandas, which we use to load, view, and analyze datasets in table format.
import numpy as np
👉 This line imports NumPy, which helps in numerical calculations and handling missing values.
import matplotlib.pyplot as plt
👉 This line imports Matplotlib, which we use for basic data visualizations.
import seaborn as sns
👉 This line imports Seaborn, which makes charts look cleaner and more professional.
Step 2: Load the Dataset
df = pd.read_excel('/mnt/data/Megamart_Sales_Performance_Dataset.xlsx')
👉 This line loads the Megamart Excel dataset into a DataFrame called df.
Step 3: View First Few Rows
df.head()
👉 This line shows the first 5 rows of the dataset so we can understand the structure.
Step 4: Check Dataset Shape
df.shape
👉 This line tells us how many rows and columns are present in the dataset.
Step 5: Check Column Names
df.columns
👉 This line displays all column names so we know what data is available.
Step 6: Dataset Information
df.info()
👉 This line shows data types, non-null values, and memory usage of each column.
Step 7: Check Missing Values
df.isnull().sum()
👉 This line checks how many missing values are present in each column.
Step 8: Statistical Summary
df.describe()
👉 This line gives statistical details like mean, min, max, and quartiles for numerical columns.
Step 9: Total Sales of Megamart
df['Total Sales'].sum()
👉 This line calculates total revenue generated by Megamart.
Step 10: Total Profit
df['Profit Amount'].sum()
👉 This line calculates total profit earned across all orders.
Step 11: Sales by Product Category
df.groupby('Product Category')['Total Sales'].sum()
👉 This line groups data by product category and calculates total sales for each category.
Step 12: Profit by Product Category
df.groupby('Product Category')['Profit Amount'].sum()
👉 This line shows which product category is generating the highest profit.
Step 13: Sales by Region
df.groupby('Region')['Total Sales'].sum()
👉 This line calculates total sales region-wise.
Step 14: Profit by Region
df.groupby('Region')['Profit Amount'].sum()
👉 This line helps identify the most profitable region.
Step 15: Quantity Sold by Category
df.groupby('Product Category')['Quantity Sold'].sum()
👉 This line shows demand volume for each product category.
Step 16: Average Profit Margin by Category
df.groupby('Product Category')['Profit Margin (%)'].mean()
👉 This line calculates the average profit margin for each category.
Step 17: Sales by Customer Segment
df.groupby('Customer Segment')['Total Sales'].sum()
👉 This line identifies which customer segment contributes most to revenue.
Step 18: Profit by Customer Segment
df.groupby('Customer Segment')['Profit Amount'].sum()
👉 This line shows which customer segment is most profitable.
Step 19: Payment Method Usage Count
df['Payment Method'].value_counts()
👉 This line shows how frequently each payment method is used.
Step 20: Order Status Distribution
df['Order Status'].value_counts()
👉 This line counts completed, pending, and canceled orders.
Step 21: Monthly Sales Trend
df['Order Date'] = pd.to_datetime(df['Order Date'])
👉 This line converts Order Date into proper date format.
df['Month'] = df['Order Date'].dt.month
👉 This line extracts month from order date.
df.groupby('Month')['Total Sales'].sum()
👉 This line calculates total sales month-wise to identify trends.
Step 22: Top 5 Selling Products
df.groupby('Product Name')['Quantity Sold'].sum().sort_values(ascending=False).head()
👉 This line finds the top 5 products sold based on quantity.
Step 23: Top 5 Profitable Products
df.groupby('Product Name')['Profit Amount'].sum().sort_values(ascending=False).head()
👉 This line identifies the top 5 products generating highest profit.
Megamart Sales Performance | Visualization EDA
Step 1: Set Visualization Style
sns.set(style="whitegrid")
👉 This line sets a clean white grid background for all charts.
Step 2: Sales by Product Category (Bar Chart)
category_sales = df.groupby('Product Category')['Total Sales'].sum()
👉 This line calculates total sales for each product category.
category_sales.plot(kind='bar')
👉 This line creates a bar chart showing sales by product category.
plt.title('Total Sales by Product Category')
👉 This line adds a title to the chart.
plt.xlabel('Product Category')
👉 This line labels the X-axis as product category.
plt.ylabel('Total Sales')
👉 This line labels the Y-axis as total sales amount.
plt.show()
👉 This line displays the chart.
Step 3: Profit by Product Category (Bar Chart)
category_profit = df.groupby('Product Category')['Profit Amount'].sum()
👉 This line calculates total profit for each product category.
category_profit.plot(kind='bar')
👉 This line creates a bar chart for profit by category.
plt.title('Total Profit by Product Category')
👉 This line adds chart title.
plt.xlabel('Product Category')
👉 This line labels X-axis.
plt.ylabel('Profit Amount')
👉 This line labels Y-axis.
plt.show()
👉 This line shows the chart.
Step 4: Sales by Region (Bar Chart)
region_sales = df.groupby('Region')['Total Sales'].sum()
👉 This line calculates region-wise sales.
region_sales.plot(kind='bar')
👉 This line creates a bar chart for regional sales.
plt.title('Total Sales by Region')
👉 This line adds a title.
plt.xlabel('Region')
👉 This line labels the X-axis.
plt.ylabel('Total Sales')
👉 This line labels the Y-axis.
plt.show()
👉 This line displays the chart.
Step 5: Profit by Region (Bar Chart)
region_profit = df.groupby('Region')['Profit Amount'].sum()
👉 This line calculates region-wise profit.
region_profit.plot(kind='bar')
👉 This line plots profit by region.
plt.title('Total Profit by Region')
👉 This line adds chart title.
plt.xlabel('Region')
👉 This line labels X-axis.
plt.ylabel('Profit Amount')
👉 This line labels Y-axis.
plt.show()
👉 This line shows the chart.
Step 6: Sales by Customer Segment (Bar Chart)
segment_sales = df.groupby('Customer Segment')['Total Sales'].sum()
👉 This line calculates total sales by customer segment.
segment_sales.plot(kind='bar')
👉 This line creates a bar chart for segment sales.
plt.title('Total Sales by Customer Segment')
👉 This line adds title.
plt.xlabel('Customer Segment')
👉 This line labels X-axis.
plt.ylabel('Total Sales')
👉 This line labels Y-axis.
plt.show()
👉 This line displays the chart.
Step 7: Order Status Distribution (Pie Chart)
order_status = df['Order Status'].value_counts()
👉 This line counts each order status.
order_status.plot(kind='pie', autopct='%1.1f%%')
👉 This line creates a pie chart showing order status distribution.
plt.title('Order Status Distribution')
👉 This line adds title.
plt.ylabel('')
👉 This line removes Y-axis label for clean pie chart.
plt.show()
👉 This line shows the chart.
Step 8: Monthly Sales Trend (Line Chart)
monthly_sales = df.groupby('Month')['Total Sales'].sum()
👉 This line calculates total sales for each month.
monthly_sales.plot(kind='line', marker='o')
👉 This line creates a line chart to show monthly trend.
plt.title('Monthly Sales Trend')
👉 This line adds chart title.
plt.xlabel('Month')
👉 This line labels X-axis.
plt.ylabel('Total Sales')
👉 This line labels Y-axis.
plt.show()
👉 This line displays the chart.
Step 9: Top 5 Products by Quantity Sold (Bar Chart)
top_products = df.groupby('Product Name')['Quantity Sold'].sum().sort_values(ascending=False).head()
👉 This line finds top 5 products by quantity sold.
top_products.plot(kind='bar')
👉 This line creates bar chart for top products.
plt.title('Top 5 Products by Quantity Sold')
👉 This line adds title.
plt.xlabel('Product Name')
👉 This line labels X-axis.
plt.ylabel('Quantity Sold')
👉 This line labels Y-axis.
plt.show()
👉 This line shows the chart.
Conclusion – Megamart Sales Performance EDA
This Exploratory Data Analysis of the Megamart Sales Performance Dataset clearly demonstrates how raw retail data can be transformed into actionable business insights using Python. By systematically analyzing sales, profit, customer segments, regions, and product categories, we were able to understand not just what is happening in the business, but why it is happening.
The analysis highlights that product categories contribute differently to revenue and profit, proving that high sales do not always mean high profitability. Region-wise evaluation shows how geographical performance varies, which can directly influence inventory planning, marketing budgets, and regional expansion strategies. Customer segment analysis further reveals which type of customers drive the most value, helping businesses focus on retention and targeted campaigns.
The visualization-driven EDA makes trends instantly understandable. Monthly sales trends uncover seasonality, while top-product analysis identifies demand leaders that deserve priority in stock management and promotions. Order status and payment method analysis provide operational insights that can improve customer experience and reduce revenue leakage.
Most importantly, this project reinforces a critical data analytics lesson: EDA is not about charts or code alone—it is about asking the right business questions. Before advanced modeling or forecasting, a strong EDA ensures data quality, clarity, and direction.
