Steps to Perform EDA and ML on Spambase dataset

Categories: Machine Learning

Tags:

Steps to Perform EDA and ML on Spambase dataset

Here’s a step-by-step guide to perform **Exploratory Data Analysis (EDA)** and build a **Machine Learning (ML)** model using the Spambase dataset from https://archive.ics.uci.edu/dataset/94/spambase :

---

### **Step 1: Load the Dataset**
1. Download the dataset from the [UCI Repository](https://archive.ics.uci.edu/ml/datasets/Spambase).
2. Load the dataset into a pandas DataFrame:
   ```python
   import pandas as pd
   
   # Load dataset
   column_names = [f'Feature_{i}' for i in range(1, 58)] + ['Spam']
   spambase_data = pd.read_csv('spambase.data', header=None, names=column_names)
   ```

---

### **Step 2: Initial Exploration**
1. **Basic Information**:
   - Check the shape of the data.
   - Inspect the first few rows using `df.head()`.
   - Check data types with `df.info()`.
   - Check for missing values with `df.isnull().sum()`.

2. **Class Distribution**:
   - Analyze the target class (`Spam`) to see the proportion of spam and non-spam emails.
     ```python
     print(spambase_data['Spam'].value_counts(normalize=True))
     ```

3. **Summary Statistics**:
   - Use `df.describe()` to examine the distribution of features.

---

### **Step 3: Data Visualization (EDA)**
1. **Distribution of Features**:
   - Plot histograms or density plots for a few selected features.
     ```python
     import matplotlib.pyplot as plt
     import seaborn as sns
     
     sns.histplot(spambase_data['Feature_1'], kde=True)
     plt.show()
     ```

2. **Correlation Analysis**:
   - Compute the correlation matrix to find relationships between features and the target variable.
     ```python
     correlation_matrix = spambase_data.corr()
     sns.heatmap(correlation_matrix, cmap='coolwarm', annot=False)
     ```

3. **Target vs. Feature Analysis**:
   - Compare the distribution of features for spam vs. non-spam emails.
     ```python
     sns.boxplot(x='Spam', y='Feature_1', data=spambase_data)
     ```

4. **Outlier Detection**:
   - Use boxplots or scatterplots to detect outliers in features.

---

### **Step 4: Preprocessing**
1. **Feature Scaling**:
   - Normalize or standardize features using `StandardScaler` or `MinMaxScaler`.
     ```python
     from sklearn.preprocessing import StandardScaler
     
     scaler = StandardScaler()
     X = spambase_data.iloc[:, :-1]  # All features except the target
     y = spambase_data['Spam']
     X_scaled = scaler.fit_transform(X)
     ```

2. **Train-Test Split**:
   - Split the dataset into training and testing sets.
     ```python
     from sklearn.model_selection import train_test_split
     
     X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
     ```

3. **Handle Imbalanced Data** (Optional):
   - If the dataset is imbalanced, use techniques like oversampling (`SMOTE`) or undersampling.

---

### **Step 5: Build Machine Learning Models**
1. **Train Models**:
   - Start with simple models like Logistic Regression and Decision Trees, and then try more advanced models (SVM, Random Forest, Gradient Boosting, etc.).
     ```python
     from sklearn.ensemble import RandomForestClassifier
     
     model = RandomForestClassifier(random_state=42)
     model.fit(X_train, y_train)
     ```

2. **Evaluate Models**:
   - Use metrics like **accuracy**, **precision**, **recall**, **F1-score**, and **ROC-AUC**.
     ```python
     from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
     
     y_pred = model.predict(X_test)
     print(classification_report(y_test, y_pred))
     print('ROC-AUC:', roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
     ```

3. **Hyperparameter Tuning**:
   - Use `GridSearchCV` or `RandomizedSearchCV` to optimize hyperparameters.
     ```python
     from sklearn.model_selection import GridSearchCV
     
     param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
     grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
     grid_search.fit(X_train, y_train)
     ```

---

### **Step 6: Interpret Results**
1. **Feature Importance**:
   - Identify the most influential features using model-specific attributes like `feature_importances_` for tree-based models.
     ```python
     importances = model.feature_importances_
     sns.barplot(x=importances, y=column_names[:-1])
     ```

2. **Performance Summary**:
   - Summarize results and compare the performance of different models.

---

Step 7: Deployment and Further Analysis
1. Save the model using `joblib` or `pickle` for deployment.

import joblib
joblib.dump(model, ‘spam_classifier.pkl’)

2. Test the model on unseen data to validate its real-world performance.

The Spambase dataset includes 57 continuous features derived from the content of emails. These features represent statistical and structural properties of the emails. Here’s a detailed explanation:

Categories of Features

The features can be broadly categorized into the following types:

Word Frequency Features (Features 1–48)
These features represent the percentage occurrence of specific words in the email text. Each value is calculated as: Word Frequency=Number of times the word appears in the emailTotal number of words in the email×100\text{Word Frequency} = \frac{\text{Number of times the word appears in the email}}{\text{Total number of words in the email}} \times 100
- Example words: "make", "address", "free", "money", "business", "credit", "you", "your", etc.
- High frequencies of words like "free" or "money" are often indicative of spam.
Character Frequency Features (Features 49–54)
These features represent the percentage of specific characters in the email text, calculated similarly to word frequencies.
- Examples: ";", "(", "[", "!", "$", and "#".
- For example, spammers may use a high frequency of "!" or "$" to emphasize promotions or financial terms.
Capital Run Length Features (Features 55–57)
These features measure patterns related to the use of uppercase (capitalized) letters, which are often used in spam emails for emphasis.
- Feature 55: Average length of capital letter runs.
  - Average number of consecutive uppercase letters in a sequence.
- Feature 56: Longest capital letter run.
  - Maximum number of consecutive uppercase letters in any sequence.
- Feature 57: Total capital letter occurrences.
  - Total number of uppercase letters in the email.
Class Label (Feature 58)
The target variable (Spam) is binary:
- 1: The email is spam.
- 0: The email is not spam.

Examples of Feature Names

Here are some sample features and their typical relevance:

Feature Name	Meaning	Example Usage
Word Frequency (`make`)	% occurrence of the word “make” in the email	Higher in promotional emails (e.g., “Make money fast”).
Word Frequency (`free`)	% occurrence of the word “free”	Common in spam emails offering free items.
Word Frequency (`money`)	% occurrence of the word “money”	Used in spam about financial offers.
Character Frequency (`!`)	% occurrence of the “!” character	Often used for emphasis in spam emails.
Capital Run Length (Longest)	Longest sequence of uppercase letters	Spammers may capitalize entire words like “WIN A FREE VACATION!”

General Observations

Spam Characteristics:
- Spam emails tend to have a high frequency of promotional words like "free", "money", and "win".
- Symbols like "!" and "$" are more common in spam emails.
- Spam emails often use capitalized words to grab attention.
Non-Spam Characteristics:
- Non-spam (ham) emails generally have more balanced and natural word distributions.
- They are less likely to include excessive capitalizations or symbols.

If you’d like, I can dive into any specific feature or provide a code example to analyze feature distributions!