Categories: NLP
Tags:

Below is a Python script demonstrating a complete text preprocessing pipeline:


Input Text

text = "Hello there! This is an example sentence, showing the basics of text preprocessing. Don't worry about the process—it’s simple!"

Step-by-Step Preprocessing Script

  1. Import Libraries:
   import re
   import nltk
   from nltk.corpus import stopwords
   from nltk.tokenize import word_tokenize
   from nltk.stem import PorterStemmer, WordNetLemmatizer

   # Download necessary NLTK data
   nltk.download('punkt')
   nltk.download('stopwords')
   nltk.download('wordnet')
  1. Lowercase Conversion:
   text = text.lower()
   print("Lowercase Text:", text)
   # Output: "hello there! this is an example sentence, showing the basics of text preprocessing. don't worry about the process—it’s simple!"
  1. Remove Special Characters and Numbers:
   text = re.sub(r'[^a-zA-Z\s]', '', text)
   print("Cleaned Text:", text)
   # Output: "hello there this is an example sentence showing the basics of text preprocessing dont worry about the processits simple"
  1. Tokenization:
   tokens = word_tokenize(text)
   print("Tokens:", tokens)
   # Output: ['hello', 'there', 'this', 'is', 'an', 'example', 'sentence', 'showing', 'the', 'basics', 'of', 'text', 'preprocessing', 'dont', 'worry', 'about', 'the', 'processits', 'simple']
  1. Remove Stop Words:
   stop_words = set(stopwords.words('english'))
   filtered_tokens = [word for word in tokens if word not in stop_words]
   print("Filtered Tokens:", filtered_tokens)
   # Output: ['hello', 'example', 'sentence', 'showing', 'basics', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']
  1. Stemming:
   stemmer = PorterStemmer()
   stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
   print("Stemmed Tokens:", stemmed_tokens)
   # Output: ['hello', 'exampl', 'sentenc', 'show', 'basic', 'text', 'preprocess', 'dont', 'worri', 'processit', 'simpl']
  1. Lemmatization:
   lemmatizer = WordNetLemmatizer()
   lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
   print("Lemmatized Tokens:", lemmatized_tokens)
   # Output: ['hello', 'example', 'sentence', 'showing', 'basic', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']

Final Output

  • Original Text:
    "Hello there! This is an example sentence, showing the basics of text preprocessing. Don't worry about the process—it’s simple!"
  • Preprocessed Text:
    ['hello', 'example', 'sentence', 'showing', 'basic', 'text', 'preprocessing', 'dont', 'worry', 'processits', 'simple']

Key Notes:

  • Stop Words Removal: Eliminates common words like “is”, “the”, etc., which don’t add significant value.
  • Stemming vs. Lemmatization:
  • Stemming produces simpler forms (may not be real words).
  • Lemmatization provides grammatically accurate root forms.

This pipeline can be customized based on specific NLP tasks, such as sentiment analysis, text classification, or chatbot development!