Tags:

Project & Dataset Introduction

In this project, I will teach you how to perform Exploratory Data Analysis (EDA) using Python on a real-world sales dataset. The objective of this project is to help you understand how raw business data can be explored, cleaned, analyzed, and interpreted using Python in a structured and logical way. This project is designed especially for beginners and intermediate learners who want to build strong fundamentals in data analysis and confidently explain insights during interviews, presentations, or video content creation.

EDA is one of the most important stages in any data-driven project. Before building dashboards, applying machine learning models, or making business decisions, it is crucial to understand the data deeply. In this project, I will walk you through how to inspect data, understand patterns, detect anomalies, and derive meaningful insights using Python libraries that work smoothly in Google Colab.

Get the Dataset here:

πŸ“₯ Download Candy Sales Dataset (Excel)

Dataset Introduction

The dataset used in this project is Candy_Sales_Dataset_200_Rows_Corrected.xlsx. This dataset represents sales transactions of a candy manufacturing and distribution business over three different years. Each row corresponds to an individual sales order placed by a customer, and the dataset captures important operational, financial, and product-level information.

This dataset is ideal for EDA because it includes:

  • Time-based data (order dates, years, months)
  • Categorical variables (country, ship mode, factory, division)
  • Numerical variables (sales, units sold, cost, profit)
  • Business metrics useful for profitability analysis

The dataset contains 200 rows, making it large enough to identify trends while still being easy to analyze and explain step by step.


Column-wise Explanation

Now, let’s understand each column and what it represents:

  1. Order Date
    This column shows the exact date on which the customer placed the order. It is useful for time-based analysis such as trends, seasonality, and year-over-year comparison.
  2. Year
    This column represents the year extracted from the order date. It helps in grouping and comparing sales performance across different years.
  3. Month
    This column indicates the month of the order. It is helpful for identifying monthly trends, seasonal spikes, or slow periods in sales.
  4. Customer ID
    This is a unique identifier assigned to each customer. It allows us to analyze customer behavior, repeat purchases, and customer-level performance.
  5. Country
    This column shows the country where the order was shipped. It helps in geographical analysis and understanding market performance by region.
  6. Ship Mode
    This represents the shipping method used for delivering the order, such as Standard Class, First Class, or Second Class. It can be analyzed to understand logistics preferences and cost implications.
  7. Factory
    This column indicates the manufacturing unit or factory where the product was produced. It is useful for operational and factory-level performance analysis.
  8. Division
    This represents the product category or division, such as Chocolate, Sugar, or Other. It helps in understanding which product categories drive the most sales and profit.
  9. Units Sold
    This column shows the number of units sold in a particular order. It is a key metric for volume-based analysis.
  10. Sales
    This represents the total revenue generated from the order. It is one of the most important financial metrics in the dataset.
  11. Profit Margin
    This column shows the profit margin percentage for the order. It helps in understanding pricing efficiency and profitability.
  12. COGS (Cost of Goods Sold)
    This represents the direct cost incurred to produce the goods sold. It is critical for profit and cost analysis.
  13. Gross Profit
    This column shows the actual profit earned from the order after subtracting COGS from sales. It is a key indicator of business performance.

Python EDA Code with Line-by-Line Explanation (Google Colab Friendly)

Get codes here: https://colab.research.google.com/drive/1knfrBVePwjqzs6nbSgQAOkzAx4ZjpUfo?usp=sharing


Step 1: Import Required Libraries

import pandas as pd

This line imports the pandas library, which is used for loading, cleaning, and analyzing structured datasets like Excel files.

import numpy as np

This line imports NumPy, which helps in numerical operations and handling missing or mathematical calculations.

import matplotlib.pyplot as plt

This line imports matplotlib, which is used to create basic data visualizations such as bar charts and line graphs.

import seaborn as sns

This line imports seaborn, a visualization library built on top of matplotlib that helps create more attractive and informative charts.


Step 2: Load the Dataset

df = pd.read_excel('/content/Candy_Sales_Dataset_200_Rows_Corrected.xlsx')

This line reads the Excel file and loads it into a pandas DataFrame called df, which is the main object we will use for analysis.

df.head()

This line displays the first five rows of the dataset so we can visually inspect the structure and values.


Step 3: Understand Dataset Shape and Structure

df.shape

This line shows the total number of rows and columns in the dataset, helping us understand the dataset size.

df.columns

This line displays all column names, which helps us verify the available fields for analysis.

df.info()

This line provides detailed information about each column, including data type and non-null counts.


Step 4: Check for Missing Values

df.isnull().sum()

This line checks each column for missing values and shows how many null values are present.


Step 5: Statistical Summary of Numerical Columns

df.describe()

This line generates descriptive statistics such as mean, minimum, maximum, and standard deviation for numerical columns.


Step 6: Convert Date Column to Date Format

df['Order Date'] = pd.to_datetime(df['Order Date'])

This line converts the Order Date column into a proper datetime format so time-based analysis can be performed.


Step 7: Sales Trend by Year

yearly_sales = df.groupby('Year')['Sales'].sum()

This line groups the dataset by year and calculates total sales for each year.

yearly_sales

This line displays the total sales values for each year.

yearly_sales.plot(kind='bar')

This line creates a bar chart showing total sales for each year.

πŸ‘‰ Graph Explanation:
This bar chart represents how total sales vary across different years, making it easy to identify growth or decline trends.

plt.title('Total Sales by Year')
plt.xlabel('Year')
plt.ylabel('Total Sales')
plt.show()

These lines add a title and labels to the chart and display it clearly.


Step 8: Monthly Sales Analysis

monthly_sales = df.groupby('Month')['Sales'].sum()

This line calculates total sales for each month across all years.

monthly_sales.plot(kind='line', marker='o')

This line creates a line chart to show how sales change month by month.

πŸ‘‰ Graph Explanation:
This line graph helps identify seasonal trends and months with high or low sales performance.

plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

These lines label and display the graph properly.


Step 9: Sales by Product Division

division_sales = df.groupby('Division')['Sales'].sum()

This line calculates total sales for each candy division.

division_sales.plot(kind='bar')

This line creates a bar chart comparing sales across product divisions.

πŸ‘‰ Graph Explanation:
This chart shows which candy division contributes the most to overall revenue.

plt.title('Sales by Division')
plt.xlabel('Division')
plt.ylabel('Sales')
plt.show()

Step 10: Profit Analysis

profit_by_division = df.groupby('Division')['Gross Profit'].sum()

This line calculates total gross profit generated by each division.

profit_by_division.plot(kind='bar')

This line creates a bar chart to compare profitability across divisions.

πŸ‘‰ Graph Explanation:
This graph highlights which product categories are the most profitable.

plt.title('Gross Profit by Division')
plt.xlabel('Division')
plt.ylabel('Gross Profit')
plt.show()

Step 11: Country-wise Sales Distribution

country_sales = df.groupby('Country')['Sales'].sum().sort_values(ascending=False)

This line calculates total sales per country and sorts them from highest to lowest.

country_sales.plot(kind='bar')

This line creates a bar chart showing sales contribution by country.

πŸ‘‰ Graph Explanation:
This chart helps identify the strongest and weakest markets geographically.

plt.title('Sales by Country')
plt.xlabel('Country')
plt.ylabel('Sales')
plt.show()

Step 12: Relationship Between Sales and Profit

sns.scatterplot(data=df, x='Sales', y='Gross Profit')

This line creates a scatter plot to analyze the relationship between sales and gross profit.

πŸ‘‰ Graph Explanation:
This scatter plot shows whether higher sales generally result in higher profit and helps detect anomalies.

plt.title('Sales vs Gross Profit')
plt.show()

Project Conclusion

In this project, we successfully performed a complete Exploratory Data Analysis (EDA) using Python on a real-world candy sales dataset. The primary objective of this analysis was to understand the structure of the data, explore key business metrics, identify trends, and extract meaningful insights that can support data-driven decision-making. By working through this dataset step by step, we demonstrated how EDA forms the foundation of any analytics or data science project.

We began by loading the dataset into Python using Google Colab–friendly libraries such as pandas, NumPy, matplotlib, and seaborn. Understanding the dataset structure using functions like head(), info(), and describe() allowed us to quickly assess data quality, column data types, and the overall health of the dataset. This initial inspection is crucial because it helps identify missing values, incorrect data types, or anomalies before moving into deeper analysis.

Through time-based analysis, we examined sales performance across different years and months. The yearly sales visualization helped us clearly see how business performance evolved over time, making it easy to spot growth patterns or fluctuations. Similarly, the monthly sales trend analysis revealed seasonality, highlighting periods where sales were consistently higher or lower. These insights are extremely valuable for planning marketing campaigns, inventory management, and demand forecasting.

We also analyzed sales and profitability across product divisions and countries. By grouping and visualizing sales by division, we identified which candy categories contributed the most to revenue and which ones drove higher profits. This kind of analysis is essential for product strategy, as it helps businesses focus on high-performing categories while re-evaluating underperforming ones. Country-wise sales analysis further provided a geographical perspective, showing which markets were strongest and where expansion or improvement opportunities might exist.

Finally, by analyzing the relationship between sales and gross profit using a scatter plot, we gained insight into how revenue translates into profitability. This step helped highlight pricing efficiency and identify any unusual patterns that might require further investigation.

Overall, this project demonstrated how Python-based EDA transforms raw data into actionable insights. It reinforced the importance of asking the right questions, visualizing data effectively, and interpreting results from a business perspective. With this structured approach, learners can confidently apply EDA techniques to any dataset, making this project a strong foundation for advanced analytics, dashboarding, and machine learning workflows.