Here’s a step-by-step guide to perform **Exploratory Data Analysis (EDA)** and build a **Machine Learning (ML)** model using the Spambase dataset from https://archive.ics.uci.edu/dataset/94/spambase :
---
### **Step 1: Load the Dataset**
1. Download the dataset from the [UCI Repository](https://archive.ics.uci.edu/ml/datasets/Spambase).
2. Load the dataset into a pandas DataFrame:
```python
import pandas as pd
# Load dataset
column_names = [f'Feature_{i}' for i in range(1, 58)] + ['Spam']
spambase_data = pd.read_csv('spambase.data', header=None, names=column_names)
```
---
### **Step 2: Initial Exploration**
1. **Basic Information**:
- Check the shape of the data.
- Inspect the first few rows using `df.head()`.
- Check data types with `df.info()`.
- Check for missing values with `df.isnull().sum()`.
2. **Class Distribution**:
- Analyze the target class (`Spam`) to see the proportion of spam and non-spam emails.
```python
print(spambase_data['Spam'].value_counts(normalize=True))
```
3. **Summary Statistics**:
- Use `df.describe()` to examine the distribution of features.
---
### **Step 3: Data Visualization (EDA)**
1. **Distribution of Features**:
- Plot histograms or density plots for a few selected features.
```python
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(spambase_data['Feature_1'], kde=True)
plt.show()
```
2. **Correlation Analysis**:
- Compute the correlation matrix to find relationships between features and the target variable.
```python
correlation_matrix = spambase_data.corr()
sns.heatmap(correlation_matrix, cmap='coolwarm', annot=False)
```
3. **Target vs. Feature Analysis**:
- Compare the distribution of features for spam vs. non-spam emails.
```python
sns.boxplot(x='Spam', y='Feature_1', data=spambase_data)
```
4. **Outlier Detection**:
- Use boxplots or scatterplots to detect outliers in features.
---
### **Step 4: Preprocessing**
1. **Feature Scaling**:
- Normalize or standardize features using `StandardScaler` or `MinMaxScaler`.
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = spambase_data.iloc[:, :-1] # All features except the target
y = spambase_data['Spam']
X_scaled = scaler.fit_transform(X)
```
2. **Train-Test Split**:
- Split the dataset into training and testing sets.
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
```
3. **Handle Imbalanced Data** (Optional):
- If the dataset is imbalanced, use techniques like oversampling (`SMOTE`) or undersampling.
---
### **Step 5: Build Machine Learning Models**
1. **Train Models**:
- Start with simple models like Logistic Regression and Decision Trees, and then try more advanced models (SVM, Random Forest, Gradient Boosting, etc.).
```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
```
2. **Evaluate Models**:
- Use metrics like **accuracy**, **precision**, **recall**, **F1-score**, and **ROC-AUC**.
```python
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print('ROC-AUC:', roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
```
3. **Hyperparameter Tuning**:
- Use `GridSearchCV` or `RandomizedSearchCV` to optimize hyperparameters.
```python
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
```
---
### **Step 6: Interpret Results**
1. **Feature Importance**:
- Identify the most influential features using model-specific attributes like `feature_importances_` for tree-based models.
```python
importances = model.feature_importances_
sns.barplot(x=importances, y=column_names[:-1])
```
2. **Performance Summary**:
- Summarize results and compare the performance of different models.
---
Step 7: Deployment and Further Analysis
1. Save the model using `joblib` or `pickle` for deployment.
import joblib
joblib.dump(model, ‘spam_classifier.pkl’)
2. Test the model on unseen data to validate its real-world performance.
The Spambase dataset includes 57 continuous features derived from the content of emails. These features represent statistical and structural properties of the emails. Here’s a detailed explanation:
Categories of Features
The features can be broadly categorized into the following types:
- Word Frequency Features (Features 1–48)
These features represent the percentage occurrence of specific words in the email text. Each value is calculated as: Word Frequency=Number of times the word appears in the emailTotal number of words in the email×100\text{Word Frequency} = \frac{\text{Number of times the word appears in the email}}{\text{Total number of words in the email}} \times 100- Example words:
"make"
,"address"
,"free"
,"money"
,"business"
,"credit"
,"you"
,"your"
, etc. - High frequencies of words like
"free"
or"money"
are often indicative of spam.
- Example words:
- Character Frequency Features (Features 49–54)
These features represent the percentage of specific characters in the email text, calculated similarly to word frequencies.- Examples:
";"
,"("
,"["
,"!"
,"$"
, and"#"
. - For example, spammers may use a high frequency of
"!"
or"$"
to emphasize promotions or financial terms.
- Examples:
- Capital Run Length Features (Features 55–57)
These features measure patterns related to the use of uppercase (capitalized) letters, which are often used in spam emails for emphasis.- Feature 55: Average length of capital letter runs.
- Average number of consecutive uppercase letters in a sequence.
- Feature 56: Longest capital letter run.
- Maximum number of consecutive uppercase letters in any sequence.
- Feature 57: Total capital letter occurrences.
- Total number of uppercase letters in the email.
- Feature 55: Average length of capital letter runs.
- Class Label (Feature 58)
The target variable (Spam
) is binary:1
: The email is spam.0
: The email is not spam.
Examples of Feature Names
Here are some sample features and their typical relevance:
Feature Name | Meaning | Example Usage |
---|---|---|
Word Frequency (make ) | % occurrence of the word “make” in the email | Higher in promotional emails (e.g., “Make money fast”). |
Word Frequency (free ) | % occurrence of the word “free” | Common in spam emails offering free items. |
Word Frequency (money ) | % occurrence of the word “money” | Used in spam about financial offers. |
Character Frequency (! ) | % occurrence of the “!” character | Often used for emphasis in spam emails. |
Capital Run Length (Longest) | Longest sequence of uppercase letters | Spammers may capitalize entire words like “WIN A FREE VACATION!” |
General Observations
- Spam Characteristics:
- Spam emails tend to have a high frequency of promotional words like
"free"
,"money"
, and"win"
. - Symbols like
"!"
and"$"
are more common in spam emails. - Spam emails often use capitalized words to grab attention.
- Spam emails tend to have a high frequency of promotional words like
- Non-Spam Characteristics:
- Non-spam (ham) emails generally have more balanced and natural word distributions.
- They are less likely to include excessive capitalizations or symbols.
If you’d like, I can dive into any specific feature or provide a code example to analyze feature distributions!