Text Preprocessing in Python

Text Preprocessing in Python | Set 2

Last Updated : 18 Mar, 2024

Text Preprocessing is one of the initial steps of Natural Language Processing (NLP) that involves cleaning and transforming raw data into suitable data for further processing. It enhances the quality of the text makes it easier to work and improves the performance of machine learning models.

In this article, we will look at some more advanced text preprocessing techniques.

Prerequisites

Before starting with this article, you need to go through the Text Preprocessing in Python | Set 1.

Also, refer to this article to learn more about Natural Language Processing - Introduction to NLP

We can see the basic preprocessing steps when working with textual data. We can use these techniques to gain more insights into the data that we have. Let's import the necessary libraries.

Python3

# import the necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import string
import re

Part of Speech Tagging

The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.

Python3

from nltk.tokenize import word_tokenize
from nltk import pos_tag
# convert text into word_tokens with their tags
def pos_tagging(text):
  word_tokens = word_tokenize(text)
  return pos_tag(word_tokens)
pos_tagging('You just gave me a scare')

Output:

[('You', 'PRP'),
 ('just', 'RB'),
 ('gave', 'VBD'),
 ('me', 'PRP'),
 ('a', 'DT'),
 ('scare', 'NN')]

In the given example, PRP stands for personal pronoun, RB for adverb, VBD for verb past tense, DT for determiner and NN for noun. We can get the details of all the part of speech tags using the Penn Treebank tagset.

Python3

# download the tagset 
nltk.download('tagsets')
# extract information about the tag
nltk.help.upenn_tagset('NN')

Output:

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...

Chunking

Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into "chunks", mainly of noun phrases. Chunking is done using regular expressions.

Python3

from nltk.tokenize import word_tokenize 
from nltk import pos_tag
# define chunking function with text and regular
# expression representing grammar as parameter
def chunking(text, grammar):
    word_tokens = word_tokenize(text)
# label words with part of speech
    word_pos = pos_tag(word_tokens)
# create a chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
# test it on the list of word tokens with tagged pos
    tree = chunkParser.parse(word_pos)
    for subtree in tree.subtrees():
        print(subtree)
    
sentence = 'the little yellow bird is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)

Output:

(S
  (NP the/DT little/JJ yellow/JJ bird/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)

In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Libraries like spaCy and Textblob are more suited for chunking.

Example:

Input: 'the little yellow bird is flying in the sky'

Output: (S (NP the/DT little/JJ yellow/JJ bird/NN) is/VBZ flying/VBG in/IN (NP the/DT sky/NN)) (NP the/DT little/JJ yellow/JJ bird/NN) (NP the/DT sky/NN)

Named Entity Recognition

As we know Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

Python3

from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
def named_entity_recognition(text):
    # tokenize the text
    word_tokens = word_tokenize(text)

    # part of speech tagging of words
    word_pos = pos_tag(word_tokens)

    # tree of word entities
    print(ne_chunk(word_pos))

text = 'Bill works for GeeksforGeeks so he went to Delhi for a meetup.'
named_entity_recognition(text)

Example:

Input: 'Bill works for GeeksforGeeks so he went to Delhi for a meetup.'

Output: (S
(PERSON Bill/NNP)
works/VBZ
for/IN
(ORGANIZATION GeeksforGeeks/NNP)
so/RB
he/PRP
went/VBD
to/TO
(GPE Delhi/NNP)
for/IN
a/DT
meetup/NN
./.)

Conclusion

In conclusion, natural language processing (NLP) plays a pivotal role in bridging the gap between human communication and computer understanding. As this field progresses, we can anticipate further innovations that will reshape how we communicate with and leverage the capabilities of intelligent systems in our daily lives and professional endeavors.

Text Preprocessing in Python

J

jacobperalta

Improve

Article Tags :

Practice Tags :

python

Similar Reads

Text Preprocessing in Python

Text processing is a key part of Natural Language Processing (NLP). It helps us clean and convert raw text data into a format suitable for analysis and machine learning. In this article, we will learn how to perform text preprocessing using various Python libraries and techniques focusing on the NLT

Python Set Coding Practice Problems

Welcome to this article on Python set practice problems! Here, weâ€™ll explore a variety of engaging Python set practice questions, including topics like Set Operations, Multiset Operations and more. Youâ€™ll also practice solving problems like replacing elements with the least greater element, finding

Put String in A Set in Python

Imagine you have a magical box in Python that only stores unique items â€“ that's exactly what a set is! Sets are like special containers designed to hold a bunch of different things without any duplicates. In Python, they provide a versatile and efficient way to manage collections of distinct element

Python Update Set Items

Python sets are a powerful data structure for storing unique, unordered items. While sets do not support direct item updating due to their unordered nature, there are several methods to update the contents of a set by adding or removing items. This article explores the different techniques for updat

String Interning in Python

String interning is a memory optimization technique used in Python to enhance the efficiency of string handling. In Python, strings are immutable, meaning their values cannot be changed after creation. String interning, or interning strings, involves reusing existing string objects rather than creat

Python - Print the last word in a sentence

Printing the last word in a sentence involves extracting the final word , often done by splitting the sentence into words or traversing the string from the end.Using split() methodsplit() method divides the string into words using spaces, making it easy to access the last word by retrieving the last

Python Program to Replace Text in a File

In this article, we are going to replace Text in a File using Python. Replacing Text could be either erasing the entire content of the file and replacing it with new text or it could mean modifying only specific words or sentences within the existing text.Method 1: Removing all text and write new te

Split and join a string in Python

The goal here is to split a string into smaller parts based on a delimiter and then join those parts back together with a different delimiter. For example, given the string "Hello, how are you?", you might want to split it by spaces to get a list of individual words and then join them back together

Output of Python program | Set 5

Predict the output of the following programs: Program 1: Python def gfgFunction(): "Geeksforgeeks is cool website for boosting up technical skills" return 1 print (gfgFunction.__doc__[17:21]) Output:coolExplanation: There is a docstring defined for this method, by putting a string

Output of Python Program | Set 1

Predict the output of following python programs: Program 1:Python r = lambda q: q * 2 s = lambda q: q * 3 x = 2 x = r(x) x = s(x) x = r(x) print (x) Output:24Explanation : In the above program r and s are lambda functions or anonymous functions and q is the argument to both of the functions. In firs