N-Gram Language Modelling with NLTK

Last Updated : 01 Aug, 2025

Language modeling involves determining the probability of a sequence of words. It is fundamental to many Natural Language Processing (NLP) applications such as speech recognition, machine translation and spam filtering where predicting or ranking the likelihood of phrases and sentences is crucial.

N-gram

N-gram is a language modelling technique that is defined as the contiguous sequence of n items from a given sample of text or speech. The N-grams are collected from a text or speech corpus. Items can be:

Words like “This”, “article”, “is”, “on”, “NLP” → unigrams
Word pairs likw “This article”, “article is”, “is on”, “on NLP” → bigrams
Triplets (trigrams) or larger combinations

N-gram Language Model

N-gram models predict the probability of a word given the previous n−1 words. For example, a trigram model uses the preceding two words to predict the next word:

Goal: Calculate p ( w | h ), the probability that the next word is w, given context/history h.

Example: For the phrase: “This article is on…”, if we want to predict the likelihood of “NLP” as the next word:

p(\text"NLP"|"This","article","is","on")

Chain Rule of Probability

The probability of a sequence of words is computed as:

P(w_1, w_2, \ldots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, w_2, \ldots, w_{i-1})

Markov Assumption

To reduce complexity, N-gram models assume the probability of a word depends only on the previous n−1 words.

P(w_i \mid w_1, \ldots, w_{i-1}) \approx P(w_i \mid w_{i-(n-1)}, \ldots, w_{i-1})

Evaluating Language Models

1. Entropy: Measures the uncertainty or information content in a distribution.

H(p) = \sum_x p(x) \cdot \left(-\log(p(x))\right)

It always give non negative.

2. Cross-Entropy: Measures how well a probability distribution predicts a sample from test data.

H(p, q) = -\sum_x p(x) \log(q(x))

Usually ≥ entropy; reflects model “surprise” at the test data.

3. Perplexity: Exponential of cross-entropy; lower values indicate a better model.

\text{Perplexity}(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i \mid w_{i-1})}}

Implementing N-Gram Language Modelling in NLTK

words = nltk.word_tokenize(' '.join(reuters.words())): tokenizes the entire Reuters corpus into words
tri_grams = list(trigrams(words)): creates 3-word sequences from the tokenized words
model = defaultdict(lambda: defaultdict(lambda: 0)): initializes nested dictionary for trigram counts
model[(w1, w2)][w3] += 1: counts occurrences of third word w3 after (w1, w2)
model[w1_w2][w3] /= total_count: converts counts to probabilities
return max(next_word_probs, key=next_word_probs.get): returns the most likely next word based on highest probability

Python

import nltk
from nltk import trigrams
from nltk.corpus import reuters
from collections import defaultdict

nltk.download('reuters')
nltk.download('punkt')

words = nltk.word_tokenize(' '.join(reuters.words()))
tri_grams = list(trigrams(words))

model = defaultdict(lambda: defaultdict(lambda: 0))
for w1, w2, w3 in tri_grams:
    model[(w1, w2)][w3] += 1

for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count


def predict_next_word(w1, w2):
    next_word_probs = model[w1, w2]
    if next_word_probs:
        return max(next_word_probs, key=next_word_probs.get)
    else:
        return "No prediction available"


print("Next Word:", predict_next_word('the', 'stock'))

Output:

Next Word: of

Advantages

Simple and Fast: Easy to build and fast to run for small n.
Interpretable: Easy to understand and debug.
Good Baseline: Useful as a starting point for many NLP tasks.

Limitations

Limited Context: Only considers a few previous words, missing long-range dependencies.
Data Sparsity: Needs lots of data; rare n-grams are common as n increases.
High Memory: Bigger n-gram models require lots of storage.
Poor with Unseen Words: Struggles with new or rare words unless smoothing is applied.

pawangfg

Improve

Article Tags :

N-Gram Language Modelling with NLTK

N-gram

N-gram Language Model

Chain Rule of Probability

Markov Assumption

Evaluating Language Models

Implementing N-Gram Language Modelling in NLTK

Advantages

Limitations

Similar Reads

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Thank You!

What kind of Experience do you want to share?