SMS / Email Spam or Not Classification using Naive Bayes Classifier

Categories: Machine Learning

Tags:

SMS / Email Spam or Not Classification using Naive Bayes Classifier

SMS Spam Collection Dataset, which is a popular dataset used for natural language processing tasks, especially spam detection.

Key details:

Source: UCI Machine Learning Repository
Link to dataset
Description: It contains a set of SMS labeled messages in English. The labels are either:
- ham – legitimate (non-spam) messages
- spam – unsolicited (spam) messages
Size: 5,574 SMS messages in total

This dataset is commonly used for training and evaluating machine learning models in text classification, especially spam filters.

Code of the project : https://colab.research.google.com/drive/1Yo1hJ4I_GYHpBeztYKoZe9cgRbTJTpqD#scrollTo=2ETQH0THr8ne

import pandas as pd
df = pd.read_csv("/content/spam.csv")

df.head()

df.Category.value_counts()

df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)


     

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]


     
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)


     
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)