Python - Efficient Text Data Cleaning
Last Updated :
18 Oct, 2021
Gone are the days when we used to have data mostly in row-column format, or we can say Structured data. In present times, the data being collected is more unstructured than structured. We have data in the form of text, images, audio etc and the ratio of Structured to Unstructured data has decreased over the years. Unstructured data is increasing at 55-65% every year.
Thus, we need to learn how to work with unstructured data to be able to extract relevant information from it and make it useful. While working with text data it is very important to pre-process it before using it for predictions or analysis.
In this article, we will be learning various text data cleaning techniques using python.
Let's take a tweet for example:
I enjoyd the event which took place yesteday & I luvd it ! The link to the show is
http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN
We will be performing data cleaning on this tweet step-wise.
Steps for Data Cleaning
1) Clear out HTML characters: A Lot of HTML entities like ' ,& ,< etc can be found in most of the data available on the web. We need to get rid of these from our data. You can do this in two ways:
- By using specific regular expressions or
- By using modules or packages available(htmlparser of python)
We will be using the module already available in python.
Code:
python3
#Escaping out HTML characters
from html.parser import HTMLParser
tweet="I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN"
tweet=HTMLParser().unescape(tweet)
print("After removing HTML characters the tweet is:-\n{}".format(tweet))
Output:
2) Encoding & Decoding Data: It is the process of converting information from simple understandable characters to complex symbols and vice versa. There are different forms of encoding &decoding like "UTF8","ascii" etc. available for text data. We should keep our data in a standard encoding format. The most common format is the UTF-8 format.
The given tweet is already in the UTF-8 format so we encoded it to ascii format and then decoded it to UTF-8 format to explain the process.
Code:
python3
#Encode from UTF-8 to ascii
encode_tweet =tweet.encode('ascii','ignore')
print("encode_tweet = \n{}".format(encode_tweet))
#decode from ascii to UTF-8
decode_tweet=encode_tweet.decode(encoding='UTF-8')
print("decode_tweet = \n{}".format(decode_tweet))
Output:
3) Removing URLs, Hashtags and Styles: In our text dataset, we can have hyperlinks, hashtags or styles like retweet text for twitter dataset etc. These provide no relevant information and can be removed. In hashtags, only the hash sign '#' will be removed. For this, we will use the re library to perform regular expression operations.
Code:
python3
#library for regular expressions
import re
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.\S+', "", tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
print("After removing Hashtags,URLs and Styles the tweet is:-\n{}".format(tweet))
Output:
4) Contraction Replacement: The text data might contain apostrophe's used for contractions. Example- "didn't" for "did not" etc. This can change the sense of the word or sentence. Hence we need to replace these apostrophes with the standard lexicons. To do so we can have a dictionary which consists of the value with which the word needs to be replaced and use that.
Few of the contractions used are:-
n't --> not 'll --> will
's --> is 'd --> would
'm --> am 've --> have
're --> are
Code:
python3
#dictionary consisting of the contraction and the actual value
Apos_dict={"'s":" is","n't":" not","'m":" am","'ll":" will",
"'d":" would","'ve":" have","'re":" are"}
#replace the contractions
for key,value in Apos_dict.items():
if key in tweet:
tweet=tweet.replace(key,value)
print("After Contraction replacement the tweet is:-\n{}".format(tweet))
Output:
5) Split attached words: Some words are joined together for example - "ForTheWin". These need to be separated to be able to extract the meaning out of it. After splitting, it will be "For The Win".
Code:
python3
import re
#separate the words
tweet = " ".join([s for s in re.split("([A-Z][a-z]+[^A-Z]*)",tweet) if s])
print("After splitting attached words the tweet is:-\n{}".format(tweet))
Output:
6 )Convert to lower case: Convert your text to lower case to avoid case sensitivity related issues.
Code:
python3
#convert to lower case
tweet=tweet.lower()
print("After converting to lower case the tweet is:-\n{}".format(tweet))
Output:
7) Slang lookup: There are many slang words which are used nowadays, and they can be found in the text data. So we need to replace them with their meanings. We can use a dictionary of slang words as we did for the contraction replacement, or we can create a file consisting of the slang words. Examples of slang words are:-
asap --> as soon as possible
b4 --> before
lol --> laugh out loud
luv --> love
wtg --> way to go
We are using a file which consists of the words. You can download the file slang.txt. Source of this file was taken from here.
Code:
python3
#open the file slang.txt
file=open("slang.txt","r")
slang=file.read()
#separating each line present in the file
slang=slang.split('\n')
tweet_tokens=tweet.split()
slang_word=[]
meaning=[]
#store the slang words and meanings in different lists
for line in slang:
temp=line.split("=")
slang_word.append(temp[0])
meaning.append(temp[-1])
#replace the slang word with meaning
for i,word in enumerate(tweet_tokens):
if word in slang_word:
idx=slang_word.index(word)
tweet_tokens[i]=meaning[idx]
tweet=" ".join(tweet_tokens)
print("After slang replacement the tweet is:-\n{}".format(tweet))
Output:
8) Standardizing and Spell Check: There might be spelling errors in the text or it might not be in the correct format. For example - "drivng" for "driving" or "I misssss this" for "I miss this". We can correct these by using the autocorrect library for python. There are other libraries available which you can use as well. First, you will have to install the library by using the command-
#install autocorrect library
pip install autocorrect
Code:
python3
import itertools
#One letter in a word should not be present more than twice in continuation
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
print("After standardizing the tweet is:-\n{}".format(tweet))
from autocorrect import Speller
spell = Speller(lang='en')
#spell check
tweet=spell(tweet)
print("After Spell check the tweet is:-\n{}".format(tweet))
Output:
9) Remove Stopwords: Stop words are the words which occur frequently in the text but add no significant meaning to it. For this, we will be using the nltk library which consists of modules for pre-processing data. It provides us with a list of stop words. You can create your own stopwords list as well according to the use case.
First, make sure you have the nltk library installed. If not then download it using the command-
#install nltk library
pip install nltk
Code:
python3
import nltk
#download the stopwords from nltk using
nltk.download('stopwords')
#import stopwords
from nltk.corpus import stopwords
#import english stopwords list from nltk
stopwords_eng = stopwords.words('english')
tweet_tokens=tweet.split()
tweet_list=[]
#remove stopwords
for word in tweet_tokens:
if word not in stopwords_eng:
tweet_list.append(word)
print("tweet_list = {}".format(tweet_list))
Output:
10) Remove Punctuations: Punctuations consists of !,<@#&$ etc.
Code:
python3
#for string operations
import string
clean_tweet=[]
#remove punctuations
for word in tweet_list:
if word not in string.punctuation:
clean_tweet.append(word)
print("clean_tweet = {}".format(clean_tweet))
Output:
These were some data cleaning techniques which we usually perform on the text data format. You can also perform some advanced data cleaning like grammar check etc.
Similar Reads
What is Data Cleaning? Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
12 min read
TextaCy module in Python In this article, we will introduce ourselves to the TextaCy module in python which is generally used to perform a variety of NLP tasks on texts. It is built upon the SpaCy module in Python. Some of the features of the TextaCy module are as follows:It provides the facility of text cleaning and prepr
12 min read
re.findall() in Python re.findall() method in Python helps us find all pattern occurrences in a string. It's like searching through a sentence to find every word that matches a specific rule. We can do this using regular expressions (regex) to create the pattern and then use re.findall() to get a list of matches.Let's say
2 min read
Python | Lemmatization with TextBlob Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.Text preprocessing includes both Stemming a
2 min read
ML | Overview of Data Cleaning Data cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
Data Cleaning in Data Mining Data Cleaning is the main stage of the data mining process, which allows for data utilization that is free of errors and contains all the necessary information. Some of them include error handling, deletion of records, and management of missing or incomplete records. Absolute data cleaning is necess
15+ min read
Introduction to TextFSM in Python TextFSM is a Python library used for parsing semi-structured text into structured data. It's particularly useful for extracting information from command-line outputs. This article will introduce you to TextFSM, explain how it works, and provide examples with code and outputs to help you get started.
4 min read
Python | Lemmatization with NLTK Lemmatization is a fundamental text pre-processing technique widely applied in natural language processing (NLP) and machine learning. Serving a purpose akin to stemming, lemmatization seeks to distill words to their foundational forms. In this linguistic refinement, the resultant base word is refer
6 min read
Mastering TF-IDF Calculation with Pandas DataFrame in Python Term Frequency-Inverse Document Frequency (TF-IDF) is a popular technique in Natural Language Processing (NLP) to transform text into numerical features. It measures the importance of a word in a document relative to a collection of documents (corpus). In this article, we will explore how to compute
5 min read
Creating a Pandas DataFrame Pandas DataFrame comes is a powerful tool that allows us to store and manipulate data in a structured way, similar to an Excel spreadsheet or a SQL table. A DataFrame is similar to a table with rows and columns. It helps in handling large amounts of data, performing calculations, filtering informati
2 min read