Below is a Python script demonstrating a complete text preprocessing pipeline:
Input Text
text = "Hello there! This is an example sentence, showing the basics of text preprocessing. Don't worry about the process—it’s simple!"
Step-by-Step Preprocessing Script
- Import Libraries:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
- Lowercase Conversion:
text = text.lower()
print("Lowercase Text:", text)
# Output: "hello there! this is an example sentence, showing the basics of text preprocessing. don't worry about the process—it’s simple!"
- Remove Special Characters and Numbers:
text = re.sub(r'[^a-zA-Z\s]', '', text)
print("Cleaned Text:", text)
# Output: "hello there this is an example sentence showing the basics of text preprocessing dont worry about the processits simple"
- Tokenization:
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Output: ['hello', 'there', 'this', 'is', 'an', 'example', 'sentence', 'showing', 'the', 'basics', 'of', 'text', 'preprocessing', 'dont', 'worry', 'about', 'the', 'processits', 'simple']
- Remove Stop Words:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print("Filtered Tokens:", filtered_tokens)
# Output: ['hello', 'example', 'sentence', 'showing', 'basics', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']
- Stemming:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)
# Output: ['hello', 'exampl', 'sentenc', 'show', 'basic', 'text', 'preprocess', 'dont', 'worri', 'processit', 'simpl']
- Lemmatization:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)
# Output: ['hello', 'example', 'sentence', 'showing', 'basic', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']
Final Output
- Original Text:
"Hello there! This is an example sentence, showing the basics of text preprocessing. Don't worry about the process—it’s simple!"
- Preprocessed Text:
['hello', 'example', 'sentence', 'showing', 'basic', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']
Key Notes:
- Stop Words Removal: Eliminates common words like “is”, “the”, etc., which don’t add significant value.
- Stemming vs. Lemmatization:
- Stemming produces simpler forms (may not be real words).
- Lemmatization provides grammatically accurate root forms.
This pipeline can be customized based on specific NLP tasks, such as sentiment analysis, text classification, or chatbot development!