Text Preprocessing Pipeline in NLP - Chapter 2.1

Categories: NLP

Tags:

Text Preprocessing Pipeline in NLP – Chapter 2.1

Below is a Python script demonstrating a complete text preprocessing pipeline:

Input Text

text = "Hello there! This is an example sentence, showing the basics of text preprocessing. Don't worry about the process—it’s simple!"

Step-by-Step Preprocessing Script

Import Libraries:

   import re
   import nltk
   from nltk.corpus import stopwords
   from nltk.tokenize import word_tokenize
   from nltk.stem import PorterStemmer, WordNetLemmatizer

   # Download necessary NLTK data
   nltk.download('punkt')
   nltk.download('stopwords')
   nltk.download('wordnet')

Lowercase Conversion:

   text = text.lower()
   print("Lowercase Text:", text)
   # Output: "hello there! this is an example sentence, showing the basics of text preprocessing. don't worry about the process—it’s simple!"

Remove Special Characters and Numbers:

   text = re.sub(r'[^a-zA-Z\s]', '', text)
   print("Cleaned Text:", text)
   # Output: "hello there this is an example sentence showing the basics of text preprocessing dont worry about the processits simple"

Tokenization:

   tokens = word_tokenize(text)
   print("Tokens:", tokens)
   # Output: ['hello', 'there', 'this', 'is', 'an', 'example', 'sentence', 'showing', 'the', 'basics', 'of', 'text', 'preprocessing', 'dont', 'worry', 'about', 'the', 'processits', 'simple']

Remove Stop Words:

   stop_words = set(stopwords.words('english'))
   filtered_tokens = [word for word in tokens if word not in stop_words]
   print("Filtered Tokens:", filtered_tokens)
   # Output: ['hello', 'example', 'sentence', 'showing', 'basics', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']

Stemming:

   stemmer = PorterStemmer()
   stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
   print("Stemmed Tokens:", stemmed_tokens)
   # Output: ['hello', 'exampl', 'sentenc', 'show', 'basic', 'text', 'preprocess', 'dont', 'worri', 'processit', 'simpl']

Lemmatization:

   lemmatizer = WordNetLemmatizer()
   lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
   print("Lemmatized Tokens:", lemmatized_tokens)
   # Output: ['hello', 'example', 'sentence', 'showing', 'basic', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']

Final Output

Original Text:
"Hello there! This is an example sentence, showing the basics of text preprocessing. Don't worry about the process—it’s simple!"
Preprocessed Text:
['hello', 'example', 'sentence', 'showing', 'basic', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']

Key Notes:

Stop Words Removal: Eliminates common words like “is”, “the”, etc., which don’t add significant value.
Stemming vs. Lemmatization:
Stemming produces simpler forms (may not be real words).
Lemmatization provides grammatically accurate root forms.

This pipeline can be customized based on specific NLP tasks, such as sentiment analysis, text classification, or chatbot development!