LSTM Based Poetry Generation Using NLP in Python
Last Updated :
17 May, 2024
One of the major tasks that one aims to accomplish in Conversational AI is Natural Language Generation (NLG) which refers to employing models for the generation of natural language. In this article, we will get our hands on NLG by building an LSTM-based poetry generator.
Note: The readers of this article are expected to be familiar with LSTM. In order to get an in-depth insight into what LSTMs are you are recommended to read this article.
Dataset
The dataset used for building the model has been obtained from Kaggle. The dataset is a compilation of poetries written by numerous poets present in the form of a text file. We can easily use this data to generate embeddings and subsequently train an LSTM model .
An excerpt from the dataset is shown below:
Building the Text Generator
The text generator can be built in the following simple steps:
Step 1. Import Necessary Libraries
Foremost, we need to import the necessary libraries. We are going to use TensorFlow with Keras for building the Bidirectional LSTM.
In case any of the mentioned libraries are not installed, then just install it with pip install [package-name] command in the terminal.
Python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow.keras.utils as ku
from wordcloud import WordCloud
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
Step 2. Loading the Dataset and Exploratory Data Analysis
Now, we'll load our dataset using pandas. Further, we need to perform some Exploratory Data Analysis so that we get to know our data better. As we are dealing with text data, the best way to do so is by generating a word cloud.
Python3
# Reading the text data file
data = open('poem.txt', encoding="utf8").read()
# EDA: Generating WordCloud to visualize
# the text
wordcloud = WordCloud(max_font_size=50,
max_words=100,
background_color="black").generate(data)
# Plotting the WordCloud
plt.figure(figsize=(8, 4))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig("WordCloud.png")
plt.show()
Output :
Step 3. Creating the Corpus
Now, we have all our data present in this massive text file. However, it is not recommended to feed our model with all the data altogether as it would lead to a lesser accuracy. Thus, we will be splitting our text into lines so that we can use them to generate text embeddings for our model.
Python3
# Generating the corpus by
# splitting the text into lines
corpus = data.lower().split("\n")
print(corpus[:10])
Output :
['stay, i said',
'to the cut flowers.',
'they bowed',
'their heads lower.',
'stay, i said to the spider,',
'who fled.',
'stay, leaf.',
'it reddened,',
'embarrassed for me and itself.',
'stay, i said to my body.']
Step 4. Fitting the Tokenizer on the Corpus
In order to generate the embeddings later, we need to fit a TensorFlow Tokenizer on the entire corpus so that it learns the vocabulary.
Python3
# Fitting the Tokenizer on the Corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
# Vocabulary count of the corpus
total_words = len(tokenizer.word_index)
print("Total Words:", total_words)
Output :
Total Words: 3807
Step 5. Generating Embeddings/Vectorization
Now we will generate embeddings for each sentence in our corpus. Embeddings are vectorized representations of our text. Since we cannot feed Machine/Deep Learning models with unstructured text, this is an imperative step. Firstly, we convert each sentence to embedding using Keras' text_to_sequence() function. Then we compute the length of the longest embedding; finally, we pad all the embeddings to that maximum length with zeros so as to ensure embeddings of equal length.
Python3
# Converting the text into embeddings
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,
maxlen=max_sequence_len,
padding='pre'))
predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
label = ku.to_categorical(label, num_classes=total_words+1)
This is how our text embeddings would look like:
array([[ 0, 0, 0, ..., 0, 0, 266],
[ 0, 0, 0, ..., 0, 266, 3],
[ 0, 0, 0, ..., 0, 0, 4],
...,
[ 0, 0, 0, ..., 8, 3807, 15],
[ 0, 0, 0, ..., 3807, 15, 4],
[ 0, 0, 0, ..., 15, 4, 203]], dtype=int32)
Step 6. Building the Bi-directional LSTM Model
By now, we are done with all the pre-processing steps that were required in order to feed the text to our model. Its time now that we start building the model. Since this is a use case of text generation, we will create a Bi-directional LSTM model as meaning plays an important role here.
Python3
# Building a Bi-Directional LSTM Model
model = Sequential()
model.add(Embedding(total_words+1, 100,
input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words+1/2, activation='relu',
kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words+1, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
print(model.summary())
The summary of the model is as follows:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 15, 100) 380800
bidirectional (Bidirectiona (None, 15, 300) 301200
l)
dropout (Dropout) (None, 15, 300) 0
lstm_1 (LSTM) (None, 100) 160400
dense (Dense) (None, 3807) 384507
dense_1 (Dense) (None, 3808) 14500864
=================================================================
Total params: 15,727,771
Trainable params: 15,727,771
Non-trainable params: 0
_________________________________________________________________
None
The model will work on a next-word-prediction-based approach wherein we will input a seed text, and the model will generate poetry by predicting the subsequent words. This is why we have used a softmax activation function which is generally used for multi-class classification use cases.
Step 7. Model Training
Having built the model architecture, we'll now train it on our pre-processed text. Here, we have trained our model for 150 Epochs.
Python3
history = model.fit(predictors, label, epochs=150, verbose=1)
The last few training epochs are shown below:
Epoch 145/150
510/510 [==============================] - 132s 258ms/step - loss: 3.3349 - accuracy: 0.8555
Epoch 146/150
510/510 [==============================] - 130s 254ms/step - loss: 3.2653 - accuracy: 0.8561
Epoch 147/150
510/510 [==============================] - 129s 253ms/step - loss: 3.1789 - accuracy: 0.8696
Epoch 148/150
510/510 [==============================] - 127s 250ms/step - loss: 3.1063 - accuracy: 0.8727
Epoch 149/150
510/510 [==============================] - 128s 251ms/step - loss: 3.0314 - accuracy: 0.8787
Epoch 150/150
We see that an accuracy score of 87% has been obtained, which is pretty decent.
It is recommended that you train the model on a GPU enabled machine. If your systems happens to not have a GPU, you can make use of Google Colab or Kaggle notebooks.
Step 8. Generating Text using the Built Model
In the final step, we will generate poetry using our model. As stated earlier, the model is based upon a next-word prediction approach - hence, we need to provide the model with some seed text.
Python3
seed_text = "The world"
next_words = 25
ouptut_text = ""
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences(
[token_list], maxlen=max_sequence_len-1,
padding='pre')
predicted = np.argmax(model.predict(token_list,
verbose=0), axis=-1)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
Output :
The world seems bright and gay and laid them all from your lip and the
liffey from the bar blackwater white and free scholar vicar laundry laurel
Finally, we have built a model from scratch that generates poetry given an input seed text. The model can be made to generate even better results by using a larger training dataset and fiddling with the model parameters.
Similar Reads
Generating Music Using ABC Notation Music generation involves creating musical compositions using various methods, including manual composition, algorithmic processes, and digital tools. ABC Notation is a text-based music notation system that allows users to write and share musical scores using simple ASCII characters.ABC Notation is
8 min read
Extracting Numeric Entities using Duckling in Python Wit.ai is a natural language processing (NLP) platform that allows developers to build conversational experiences for various applications. One of the key features of Wit.ai is its entity extraction system, which can recognize and extract entities from user input. One of the key features provided by
4 min read
ML | Text Generation using Gated Recurrent Unit Networks Gated Recurrent Unit (GRU) are a type of Recurrent Neural Network (RNN) that are designed to handle sequential data such as text by using gating mechanisms to regulate the flow of information. Unlike traditional RNNs which suffer from vanishing gradient problems, GRUs offer a more efficient way to c
5 min read
Text Generation using Fnet Transformer-based models excel in understanding and processing sequences due to their utilization of a mechanism known as "self-attention." This involves scrutinizing each token to discern its relationship with every other token in the sequence. Despite the effectiveness of self-attention, its drawb
13 min read
Plagiarism Detection using Python In this article, we will learn how to check plagiarism using Python. Plagiarism: Plagiarism refers to cheating. It means stealing someone else's work, ideas, or information from the resources without providing the necessary credit to the author and for example, copying text from different resources
9 min read
Python Code Generation Using Transformers Python's code generation capabilities streamline development, empowering developers to focus on high-level logic. This approach enhances productivity, creativity, and innovation by automating intricate code structures, revolutionizing software development. Automated Code Generation Automated code ge
3 min read
Music Generation Using RNN Most of us love to hear music and sing music. Its creation involves a blend of creativity, structure, and emotional depth. The fusion of music and technology led to advancements in generating musical compositions using artificial intelligence which can lead to outstanding creativity. One of the sign
12 min read
NLP Augmentation with nlpaug Python Library Data augmentation is a crucial step in building robust AI models, especially during the data preparation phase. This process involves adding synthetic data to the existing datasets to enhance the quality and diversity of the training data. For textual models, such as generative chatbots and translat
6 min read
Text2Text Generations using HuggingFace Model Text2Text generation is a versatile and powerful approach in Natural Language Processing (NLP) that involves transforming one piece of text into another. This can include tasks such as translation, summarization, question answering, and more. HuggingFace, a leading provider of NLP tools, offers a ro
5 min read
Building an Autocorrector Using NLP in Python Autocorrect feature predicts and correct misspelled words, it helps to save time invested in the editing of articles, emails and reports. This feature is added many websites and social media platforms to ensure easy typing. In this tutorial we will build a Python-based autocorrection feature using N
4 min read