SMS Spam Collection Dataset, which is a popular dataset used for natural language processing tasks, especially spam detection.
Key details:
- Source: UCI Machine Learning Repository
Link to dataset - Description: It contains a set of SMS labeled messages in English. The labels are either:
ham– legitimate (non-spam) messagesspam– unsolicited (spam) messages
- Size: 5,574 SMS messages in total
This dataset is commonly used for training and evaluating machine learning models in text classification, especially spam filters.
Code of the project : https://colab.research.google.com/drive/1Yo1hJ4I_GYHpBeztYKoZe9cgRbTJTpqD#scrollTo=2ETQH0THr8ne
import pandas as pd
df = pd.read_csv("/content/spam.csv")
df.head()
df.Category.value_counts()
df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:2]
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
emails = [
'Hey mohan, can we get together to watch footbal game tomorrow?',
'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]
emails_count = v.transform(emails)
model.predict(emails_count)
