Open In App

Dictionary Based Tokenization in NLP

Last Updated : 30 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In Natural Language Processing (NLP), dictionary-based tokenization is the process in which the text is split into tokens using a predefined dictionary of words, phrases and expressions. This method is useful when we need to treat multi-word expressions such as names or domain-specific phrases as a single token.

For example, in Named Entity Recognition (NER) the phrase "New York" should be recognized as one token not two separate words ("New" and "York"). Dictionary-based tokenization makes this possible by referencing a predefined list of multi-word expressions. Unlike regular word tokenization which simply splits a sentence into individual words, it groups specific terms together based on the dictionary entries. This ensures that important phrases are treated as a single unit which is important for many NLP tasks.

How Does Dictionary-Based Tokenization Work?

In dictionary-based tokenization, the process of splitting the text into tokens is guided by a predefined dictionary of multi-word expressions, words and phrases. Let's see how the process typically works:

1. Input Text: We start with a string of text that needs to be tokenized. For example:

"San Francisco is a beautiful city."

2. Dictionary Lookup: Each word or multi-word expression in the input text is checked against the dictionary. If a word or phrase matches an entry in the dictionary, it is extracted as a single token.

3. Token Matching: If the word or phrase exists in the dictionary, it is grouped as a token. For example:

"San Francisco" will be treated as a single token.

4. Handling Unmatched Words: If a word is not found in the dictionary, it is left as is or further split into smaller components. This can involve:

  • Splitting a word into subwords or characters.
  • Keeping it as a single token if no dictionary match is found.

For example, if a word like "NLP" is not in the dictionary, it may be broken into:

  • 'N' and 'LP' or treated as a special token if predefined in the dictionary.

5. Types of Dictionaries Used:

  1. Word Lists: These are basic dictionaries containing standard words.
  2. Subword Dictionaries: These include common prefixes, suffixes and smaller units that help in handling rare words or out-of-vocabulary (OOV) terms.
  3. Special Tokens: Predefined tokens for handling specific terms like numbers, punctuation or symbols (e.g "!", "$").

Steps for Implementing Dictionary-Based Tokenization

Let’s see the steps required to implement Dictionary-Based Tokenization in NLP using Python and NLTK (Natural Language Toolkit).

1. Importing Required Libraries

First, we need to import the necessary libraries such as NLTK, spaCy and NumPy for handling arrays and processing data.

Python
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import MWETokenizer
import numpy as np

2. Preparing the Dictionary

The key part of dictionary-based tokenization is to have a predefined dictionary that contains multi-word expressions. For example, we will create a custom dictionary containing phrases that should be treated as single tokens like place names, organizations etc.

Python
custom_dict = [('San', 'Francisco'), ('New', 'York'), ('United', 'Nations')]

3. Preprocessing the Text

Before tokenizing, we need to clean the text. This may involve removing punctuation marks, stop words or any irrelevant characters to prepare the text for further processing.

Python
def preprocess_text(text):
    text = text.lower()  
    return text
sample_text = "San Francisco is a beautiful city. The United Nations meets regularly."
cleaned_text = preprocess_text(sample_text)

4. Tokenizing the Text

Next, we split the cleaned text into individual words using a basic tokenizer. This is where the dictionary will come into play, as multi-word expressions should remain intact.

Python
tokens = word_tokenize(cleaned_text)
print("Tokenized text:", tokens)

Output:

Tokenized text: ['san', 'francisco', 'is', 'a', 'beautiful', 'city', '.', 'the', 'united', 'nations', 'meets', 'regularly', '.']

5. Applying Dictionary-Based Tokenization

Now, we apply the dictionary-based tokenization. The MWETokenizer (Multi-Word Expression Tokenizer) in NLTK helps in grouping multi-word expressions from the predefined dictionary into a single token.

Python
tokenizer = MWETokenizer(custom_dict)

tokenized_text = tokenizer.tokenize(tokens)
print("Dictionary-based tokenized text:", tokenized_text)

Output:

Dictionary-based tokenized text: ['san', 'francisco', 'is', 'a', 'beautiful', 'city', '.', 'the', 'united', 'nations', 'meets', 'regularly', '.']

6. Handling Unmatched Tokens

In the tokenization process, words that are not found in the dictionary remain as individual tokens. This helps in keeping the flexibility of the process while ensuring multi-word expressions are correctly tokenized.

Python
unmatched_tokens = [token for token in tokens if token not in ['San', 'Francisco', 'United', 'Nations']]
print("Unmatched Tokens:", unmatched_tokens)

Output:

Unmatched Tokens: ['san', 'francisco', 'is', 'a', 'beautiful', 'city', '.', 'the', 'united', 'nations', 'meets', 'regularly', '.']

7. Example of Dictionary-Based Tokenization in Action

To see dictionary-based tokenization in action, let’s consider the sentence "San Francisco is part of the United Nations."

Python
sentence = "San Francisco is part of the United Nations."
cleaned_sentence = preprocess_text(sentence)
tokens = word_tokenize(cleaned_sentence)

tokenized_sentence = tokenizer.tokenize(tokens)

print("Tokenized sentence:", tokenized_sentence)

Output:

Tokenized sentence: ['san', 'francisco', 'is', 'part', 'of', 'the', 'united', 'nations', '.']

8. Customizing the Dictionary

If we're working with domain-specific text, we can continuously expand the dictionary to include more multi-word expressions, ensuring accurate tokenization in specialized applications.

Python
custom_dict.extend([('Machine', 'Learning'), ('Natural', 'Language', 'Processing')])

tokenizer = MWETokenizer(custom_dict)

9. Visualizing Tokenization Output

For better understanding, we can visualize how the dictionary-based tokenization is working over various sentences. This can help confirm whether multi-word expressions are accurately grouped as single tokens.

Python
sentences = [
    "San Francisco is a beautiful place.",
    "The United Nations is headquartered in New York.",
    "Machine Learning is a subset of Artificial Intelligence."
]

for sentence in sentences:
    cleaned_sentence = preprocess_text(sentence)
    tokens = word_tokenize(cleaned_sentence)
    tokenized_sentence = tokenizer.tokenize(tokens)
    print(f"Original: {sentence}")
    print(f"Tokenized: {tokenized_sentence}\n")

Output:

dictionary-based
Visualizing Tokenization Output

Advantages of Dictionary-Based Tokenization

  1. Handling Multi-Word Entities: This method works great for handling complex entities like locations, names or other specific terms that should remain intact.
  2. Efficiency: It is faster compared to more complex techniques that rely on machine learning models.
  3. Simplicity: The approach is easy to implement and doesn’t require a large amount of training data, making it a good choice for smaller projects or real-time applications.

Challenges and Limitations

  1. Out-of-Vocabulary (OOV) Words: If a word or phrase isn’t included in the dictionary, it could be split incorrectly or missed entirely.
  2. Limited Coverage: The dictionary may not be comprehensive enough to handle all possible variations or new terms in the text.
  3. Ambiguity: Some words might have different meanings based on context. For example, "lead" could be a noun or verb and dictionary-based tokenization might struggle to handle such ambiguities.

By using dictionary-based tokenization, NLP systems can efficiently recognize and process multi-word expressions, enhancing their accuracy and performance across a wide range of language tasks.


Practice Tags :

Similar Reads