The Student Performance Analysis Dataset on the UCI Machine Learning Repository is a dataset used to analyze factors influencing student academic performance. This dataset is often used for predictive modeling, classification, and data analysis projects.
Key Features of the Dataset:
- Source:
- The data was collected from secondary education students in Portugal.
- It originated from a study aimed at predicting student performance based on various personal, social, and school-related factors.
- Dataset Details:
- The dataset contains two subsets:
- Mathematics: Focused on math performance.
- Portuguese: Focused on Portuguese language performance.
- Each subset contains the same features.
- The dataset contains two subsets:
- Features: The dataset includes student attributes such as:
- Demographic information:
sex
: Student’s gender.age
: Student’s age.address
: Urban or rural address (U
orR
).
- Family-related attributes:
famsize
: Family size.Pstatus
: Parent’s cohabitation status (living together or apart).Medu
,Fedu
: Education level of mother and father.Mjob
,Fjob
: Mother’s and father’s job.guardian
: Guardian of the student.
- Social/School factors:
schoolsup
: Extra educational support.famsup
: Family support for study.paid
: Extra paid classes.activities
: Participation in extracurricular activities.internet
: Internet access at home.romantic
: In a romantic relationship.
- Academic information:
absences
: Number of school absences.G1
,G2
,G3
: Grades for the first, second, and final periods.
- Demographic information:
- Target Variable:
- The primary target variable is
G3
(final grade), which can be used for:- Regression (predicting numerical grades).
- Classification (e.g., pass/fail or grade categories).
- The primary target variable is
- Applications:
- Analyzing factors that affect academic performance.
- Predicting student success.
- Identifying students who need additional support.
- Size:
- Number of records: Approximately 395 students per dataset.
- Number of features: 33 attributes (including target).
Accessing the Dataset
The dataset is available for free on the UCI Machine Learning Repository.
Here is EDA and Machine Learning Implementation on this Dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Load the datasets
url_mat = "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student-mat.csv"
url_por = "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student-por.csv"
student_mat = pd.read_csv(url_mat, sep=';')
student_por = pd.read_csv(url_por, sep=';')
# Combine the datasets if needed
students = pd.concat([student_mat, student_por], axis=0).reset_index(drop=True)
# Quick overview of the data
print("Dataset Shape:", students.shape)
print("\nColumns:\n", students.columns)
print("\nMissing Values:\n", students.isnull().sum())
print("\nFirst Few Rows:\n", students.head())
# EDA: Check basic statistics
print("\nDescriptive Statistics:\n", students.describe())
# EDA: Visualizations
sns.countplot(data=students, x='sex', palette='coolwarm')
plt.title("Gender Distribution")
plt.show()
sns.histplot(data=students, x='age', kde=True, bins=10, color='skyblue')
plt.title("Age Distribution")
plt.show()
sns.boxplot(data=students, x='sex', y='G3', palette='coolwarm')
plt.title("Final Grade (G3) Distribution by Gender")
plt.show()
sns.heatmap(students.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
# Data Preprocessing
students['target'] = np.where(students['G3'] >= 10, 1, 0) # Binary classification: Pass (1) or Fail (0)
X = students.drop(columns=['G1', 'G2', 'G3', 'target']) # Drop irrelevant or target columns
y = students['target']
# One-hot encoding for categorical variables
X = pd.get_dummies(X, drop_first=True)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# ML: Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
# Predictions
y_pred = rf_model.predict(X_test)
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))
# Feature Importance
importances = rf_model.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances}).sort_values(by='Importance', ascending=False)
# Plot Feature Importance
sns.barplot(data=importance_df, x='Importance', y='Feature', palette='viridis')
plt.title("Feature Importance")
plt.show()