Open In App

N-Gram Language Modelling with NLTK

Last Updated : 01 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Language modeling involves determining the probability of a sequence of words. It is fundamental to many Natural Language Processing (NLP) applications such as speech recognition, machine translation and spam filtering where predicting or ranking the likelihood of phrases and sentences is crucial.

N-gram

N-gram is a language modelling technique that is defined as the contiguous sequence of n items from a given sample of text or speech. The N-grams are collected from a text or speech corpus. Items can be:

  • Words like “This”, “article”, “is”, “on”, “NLP” → unigrams
  • Word pairs likw “This article”, “article is”, “is on”, “on NLP” → bigrams
  • Triplets (trigrams) or larger combinations

N-gram Language Model

N-gram models predict the probability of a word given the previous n−1 words. For example, a trigram model uses the preceding two words to predict the next word:

Goal: Calculate p ( w | h ), the probability that the next word is w, given context/history h.

Example: For the phrase: “This article is on…”, if we want to predict the likelihood of “NLP” as the next word:

p(\text"NLP"|"This","article","is","on")

Chain Rule of Probability

The probability of a sequence of words is computed as:

P(w_1, w_2, \ldots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, w_2, \ldots, w_{i-1})

Markov Assumption

To reduce complexity, N-gram models assume the probability of a word depends only on the previous n−1 words.

P(w_i \mid w_1, \ldots, w_{i-1}) \approx P(w_i \mid w_{i-(n-1)}, \ldots, w_{i-1})

Evaluating Language Models

1. Entropy: Measures the uncertainty or information content in a distribution.

H(p) = \sum_x p(x) \cdot \left(-\log(p(x))\right)

It always give non negative.

2. Cross-Entropy: Measures how well a probability distribution predicts a sample from test data.

H(p, q) = -\sum_x p(x) \log(q(x))

Usually ≥ entropy; reflects model “surprise” at the test data.

3. Perplexity: Exponential of cross-entropy; lower values indicate a better model.

\text{Perplexity}(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{i-1})}}

Implementing N-Gram Language Modelling in NLTK

  • words = nltk.word_tokenize(' '.join(reuters.words())): tokenizes the entire Reuters corpus into words
  • tri_grams = list(trigrams(words)): creates 3-word sequences from the tokenized words
  • model = defaultdict(lambda: defaultdict(lambda: 0)): initializes nested dictionary for trigram counts
  • model[(w1, w2)][w3] += 1: counts occurrences of third word w3 after (w1, w2)
  • model[w1_w2][w3] /= total_count: converts counts to probabilities
  • return max(next_word_probs, key=next_word_probs.get): returns the most likely next word based on highest probability
Python
import nltk
from nltk import trigrams
from nltk.corpus import reuters
from collections import defaultdict

nltk.download('reuters')
nltk.download('punkt')

words = nltk.word_tokenize(' '.join(reuters.words()))
tri_grams = list(trigrams(words))

model = defaultdict(lambda: defaultdict(lambda: 0))
for w1, w2, w3 in tri_grams:
    model[(w1, w2)][w3] += 1

for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count


def predict_next_word(w1, w2):
    next_word_probs = model[w1, w2]
    if next_word_probs:
        return max(next_word_probs, key=next_word_probs.get)
    else:
        return "No prediction available"


print("Next Word:", predict_next_word('the', 'stock'))

Output:

Next Word: of

Advantages

  • Simple and Fast: Easy to build and fast to run for small n.
  • Interpretable: Easy to understand and debug.
  • Good Baseline: Useful as a starting point for many NLP tasks.

Limitations

  • Limited Context: Only considers a few previous words, missing long-range dependencies.
  • Data Sparsity: Needs lots of data; rare n-grams are common as n increases.
  • High Memory: Bigger n-gram models require lots of storage.
  • Poor with Unseen Words: Struggles with new or rare words unless smoothing is applied.

Similar Reads