You're reading from Artificial Intelligence with Python Your complete guide to building intelligent apps using Python 3.x

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781839219535

Length 618 pages

Edition 2nd Edition

Languages

Python

Tools

TensorFlow

Concepts

Artificial Intelligence

Authors (2):

Alberto Artasanchez

Joshi

View More author details

Table of Contents (26) Chapters

Preface

1. Introduction to Artificial Intelligence

2. Fundamental Use Cases for Artificial Intelligence FREE CHAPTER

3. Machine Learning Pipelines

4. Feature Selection and Feature Engineering

5. Classification and Regression Using Supervised Learning

6. Predictive Analytics with Ensemble Learning

7. Detecting Patterns with Unsupervised Learning

8. Building Recommender Systems

9. Logic Programming

10. Heuristic Search Techniques

11. Genetic Algorithms and Genetic Programming

12. Artificial Intelligence on the Cloud

13. Building Games with Artificial Intelligence

14. Building a Speech Recognizer

15. Natural Language Processing

16. Chatbots

17. Sequential Data and Time Series Analysis

18. Image Recognition

19. Neural Networks

20. Deep Learning with Convolutional Neural Networks

21. Recurrent Neural Networks and Other Deep Learning Models

22. Creating Intelligent Agents with Reinforcement Learning

23. Artificial Intelligence and Big Data

24. Other Books You May Enjoy

25. Index

Tokenizing text data

When we deal with text, we need to break it down into smaller pieces for analysis. To do this, tokenization can be applied. Tokenization is the process of dividing text into a set of pieces, such as words or sentences. These pieces are called tokens. Depending on what we want to do, we can define our own methods to divide the text into many tokens. Let's look at how to tokenize the input text using NLTK.

Create a new Python file and import the following packages:

from nltk.tokenize import sent_tokenize, \
        word_tokenize, WordPunctTokenizer

Define the input text that will be used for tokenization:

# Define input text
input_text = "Do you know how tokenization works? It's actually \ 
   quite interesting! Let's analyze a couple of sentences and \
   figure it out."

Divide the input text into sentence tokens:

# Sentence tokenizer 
print("\nSentence tokenizer:")
print(sent_tokenize(input_text))

Divide the input text into word tokens:

# Word tokenizer
print("\nWord tokenizer:")
print(word_tokenize(input_text))

Divide the input text into word tokens using the WordPunct tokenizer:

# WordPunct tokenizer
print("\nWord punct tokenizer:")
print(WordPunctTokenizer().tokenize(input_text))

The full code is given in the file tokenizer.py. If you run the code, you will get the following output:

Figure 1: Tokenizers output

The sentence tokenizer divides the input text into sentences. Two-word tokenizers behave differently when it comes to punctuation. For example, the word "It's" is divided differently by the punct tokenizer than by the regular tokenizer.