Text Augmentation Using Corrupted-Text Python Library
Last Updated :
05 Sep, 2024
Text augmentation is an essential technique in Natural Language Processing (NLP) that helps improve model robustness by expanding the training data. One popular method is introducing corrupted or noisy text to simulate real-world scenarios where data may not always be clean.
The article explores how to implement text augmentation using the corrupted-text
Python library, a powerful tool that allows easy corruption of text data with different severities.
Text Augmentation in Natural Language Processing
Text augmentation involves modifying existing text data to create new variants. This can include changes in word order, synonym replacement, or introducing spelling errors. The goal is to enhance the dataset's variability, providing models with a wider range of inputs to learn from. This process helps in building more resilient and versatile NLP systems.
The Corrupted-Text Python Library offers a unique approach to text augmentation by applying model-independent corruptions. These corruptions mimic real-world errors, such as bad autocorrections and typos, making the augmented data more realistic. By using this library, developers can simulate out-of-distribution text scenarios and effectively test their models' robustness against such variations.
Corrupted-Text Python Library
The Corrupted-Text Python Library is designed to generate out-of-distribution text datasets through the application of common corruptions. Unlike model-specific adversarial approaches, this library focuses on realistic outliers to help researchers study model robustness. The corruptions are applied on a per-word basis, ensuring that each modification is independent and contributes to the overall variability of the text.
Corruptions implemented in this library include bad autocorrection, bad autocompletion, bad synonym replacement, and typographical errors. These alterations are based on common words, which are extracted from a base dataset. The corruptions mimic realistic text input errors found in everyday communication, such as incorrect autocorrections on mobile keyboards or dictionary-based translations without context.
The severity of corruption can be adjusted, allowing users to control the percentage of words affected. Higher severities result in more extensive corruption, which can significantly impact model accuracy. Users can also define weights for each corruption type, tailoring the augmentation process to their specific needs. The library provides insights into how such corruptions affect model performance, as demonstrated by the accuracies table included in the documentation.
To install the Corrupted-Text Python Library, simply run this command:
pip install corrupted-text.
Implementation: Text Augmentation Using Corrupted-Text Python Library
To implement the Corrupted-Text Python Library, we will first import the corrupted_text library and load_dataset() function from the ‘datasets’ library. Using huggingface's datasets library, we will load the AG News dataset for demonstration. We will create both the training dataset and the testing dataset.
Python
from datasets import load_dataset
import corrupted_text
train_data = load_dataset("ag_news", split="train")["text"]
test_data = load_dataset("ag_news", split="test")["text"]
Output:
Downloading readme: 100%
8.07k/8.07k [00:00<00:00, 99.7kB/s]
Downloading data: 100%
18.6M/18.6M [00:00<00:00, 20.6MB/s]
Downloading data: 100%
1.23M/1.23M [00:00<00:00, 4.62MB/s]
Generating train split: 100%
120000/120000 [00:00<00:00, 255345.38 examples/s]
Generating test split: 100%
7600/7600 [00:00<00:00, 124199.17 examples/s]
Next, we will fit the corrupted on both training and testing datasets, using the TextCorrupter() function of the corrupted_text library.
Python
text_corruptor = corrupted_text.TextCorruptor(base_dataset=test_data + train_data, cache_dir=".mycache")
Output:
Calculating Levenshtein distances: 100%|██████████| 4000/4000 [00:07<00:00, 505.19it/s]
We will then proceed to corrupt a small sample of the test dataset with a severity of 0.5. This will generate a list of corrupted text strings, simulating realistic text errors. Finally, we will print the results in a formatted style to illustrate the differences between the original and corrupted texts, highlighting the potential impact of these corruptions on NLP models.
Python
sample_data = test_data[:10]
corruption_severity = 0.5
corrupted_texts = text_corruptor.corrupt(sample_data, severity=corruption_severity, seed=1)
print(f"==== Corruption Results with Severity {round(corruption_severity, 2)} ====")
for index in range(len(sample_data)):
print(f"Original Text {index + 1}:")
print(f"{sample_data[index]}\n")
print(f"Corrupted Text {index + 1}:")
print(f"{corrupted_texts[index]}\n")
print("-" * 40)
Output:
Corrupting dataset: 100%|██████████| 3/3 [00:00<00:00, 252.20it/s]==== Corruption Results with Severity 0.5 ====
Original Text 1:
Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.
Corrupted Text 1:
Fekrs uor T N pension afternoon talks Unions representing worked at Turnbr Newall saying they are ' disappointing ' afterward talking with striker arena firms Federal Mogul.
----------------------------------------
Original Text 2:
The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.
Corrupted Text 2:
Tae Race is On : Second Private Team Sets Launch Date for Human Spaceflight ( SPACl. computer ) SPACE. com - TORnNTO, Canada -- A instant \ team hf rocketeers competing for the # 36 ; 10 zillion Ansari X Pruze, a contest yor \ pirates funded subscribers shape flights, has officially declared their first \ launch database for itk manned Eruca vesicaria sativ.
----------------------------------------
Original Text 3:
Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.
Corrupted Text 3:
yy. Company Wins Grant to Study Peptides ( AP ) AP - A company found away a chemistry researcher at theory University of Louisville wan a grant to develop a method of reducing better peptides, wyich are short chains of amino acids, nhe bidding blocks of proteins.
----------------------------------------
The output of your implementation provides both the original and corrupted versions of the sample text after applying the text augmentation process with a corruption severity of 0.5 using the corrupted-text
library.
The corrupted version of the text introduces various types of errors, such as:
- Character-level changes: Swapping or altering letters (e.g., "Fears" → "Fekrs").
- Word-level substitutions: Replacing words with similar-sounding or nonsensical alternatives (e.g., "after" → "afternoon").
- Random insertion: Inserting unrelated terms or letters, such as "(SPACl. computer)".
- Grammatical and semantic changes: Altering the grammatical structure, making the text harder to understand.
The corruption severity of 0.5 introduces a moderate level of noise, making the text less readable but still somewhat understandable. This simulates real-world noisy data that can be used to improve the robustness of NLP models.
Conclusion
In conclusion, the Corrupted-Text Python Library offers a robust solution for generating realistic out-of-distribution text datasets. By applying independent corruptions, it allows researchers to explore model vulnerabilities and improve robustness against common text errors. This library is a valuable tool for those looking to enhance their text datasets and build more resilient NLP models.
Similar Reads
Easy-NLP-Augmentation Library: Simplifying Text Augmentation in Python Text augmentation is a critical step in data preparation for building robust language models. By creating variations of existing text data, we can enhance the diversity and size of the training dataset, ultimately improving the performance of language models. This process involves applying transform
8 min read
NLP Augmentation with nlpaug Python Library Data augmentation is a crucial step in building robust AI models, especially during the data preparation phase. This process involves adding synthetic data to the existing datasets to enhance the quality and diversity of the training data. For textual models, such as generative chatbots and translat
6 min read
Text augmentation techniques in NLP Text augmentation is an important aspect of NLP to generate an artificial corpus. This helps in improving the NLP-based models to generalize better over a lot of different sub-tasks like intent classification, machine translation, chatbot training, image summarization, etc. Text augmentation is used
12 min read
TextPrettifier: Library for Text Cleaning and Preprocessing In today's data-driven world, text data plays a crucial role in various applications, from sentiment analysis to machine learning. However, raw text often requires extensive cleaning and preprocessing to be effectively utilized. Enter TextPrettifier, a powerful Python library designed to streamline
4 min read
Image Enhancement Techniques using OpenCV - Python Image enhancement is the process of improving the quality and appearance of an image. It can be used to correct flaws or defects in an image, or to simply make an image more visually appealing. Image enhancement techniques can be applied to a wide range of images, including photographs, scans, and d
15+ min read
Building an Autocorrector Using NLP in Python Autocorrect feature predicts and correct misspelled words, it helps to save time invested in the editing of articles, emails and reports. This feature is added many websites and social media platforms to ensure easy typing. In this tutorial we will build a Python-based autocorrection feature using N
4 min read
Generate Images from Text in Python - Stable Diffusion Looking for the images can be quite a hassle don't you think? Guess what? AI is here to make it much easier! Just imagine telling your computer what kind of picture you're looking for and voila it generates it for you. That's where Stable Diffusion, in Python, comes into play. It's like magic â tran
7 min read
Text-to-Image using Stable Diffusion HuggingFace Model Models available through HuggingFace utilize advanced machine-learning techniques for a variety of applications, from natural language processing to computer vision. Recently, they have expanded to include the ability to generate images directly from text descriptions, prominently featuring models l
3 min read
Processing text using NLP | Basics In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and program
2 min read
Python - Preprocessing of Tamil Text Preprocessing is the major part of Natural Language Processing. In order to classify any text with high accuracy, cleaned data plays a major role. So, the first step in NLP before analyzing or classifying is preprocessing of data. Â Many python libraries support preprocessing for the English languag
3 min read