SlideShare a Scribd company logo
Categorizing and POS Tagging with NLTK Python
CHAPTER – 4
THE BASICS OF SEARCH ENGINE FRIENDLY DESIGN & DEVELOPMENT
Copyright @ 2019 Learntek. All Rights Reserved. 3
Categorizing and POS Tagging with NLTK Python
Natural language processing is a sub-area of computer science, information
engineering, and artificial intelligence concerned with the interactions between
computers and human (native) languages. This is nothing but how to program
computers to process and analyze large amounts of natural language data.
NLP = Computer Science + AI + Computational
Linguistics
In another way, Natural language processing is the capability of computer software
to understand human language as it is spoken. NLP is one of the component of
artificial intelligence (AI).
Copyright @ 2019 Learntek. All Rights Reserved. 4
About NLTK :
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and
programs for symbolic and statistical natural language processing (NLP) for English
written in the Python programming language.
•It was developed by Steven Bird and Edward Loper in the Department of Computer
and Information Science at the University of Pennsylvania.
•A software package for manipulating linguistic data and performing NLP tasks.
•NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information
retrieval, and machine learning
Copyright @ 2019 Learntek. All Rights Reserved. 5
•NLTK supports classification, tokenization, stemming, tagging, parsing, and
semantic reasoning functionalities.
•NLTK includes more than 50 corpora and lexical sources such as the Penn Treebank
Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency
Thesaurus.
The process of classifying words into their parts of speech and labelling them
accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
Parts of speech are also known as word classes or lexical categories. The collection
of tags used for a particular task is known as a tag set.
Copyright @ 2019 Learntek. All Rights Reserved. 6
Using a Tagger
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a
part of speech tag to each word. To do this first we have to use tokenization concept
(Tokenization is the process by dividing the quantity of text into smaller parts called
tokens.)
Copyright @ 2019 Learntek. All Rights Reserved. 7
>>> import nltk
>>>from nltk.tokenize import word_tokenize
>>> text = word_tokenize("Hello welcome to the world of to learn Categorizing and POS Tagging with NLTK
and Python")
>>> nltk.pos_tag(text)
OUTPUT:
[('Hello', 'NNP'), ('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('world', 'NN'), ('of', 'IN'), ('to', 'TO'), ('learn', 'VB'), ('Categorizing',
'NNP'), ('and', 'CC'), ('POS', 'NNP'), ('Tagging', 'NNP'), ('with', 'IN'), ('NLTK', 'NNP'), ('and', 'CC'), ('Python', 'NNP')]
Copyright @ 2019 Learntek. All Rights Reserved. 8
In the above output and is CC, a coordinating conjunction;
Learn is VB, or verbs;
for is IN, a preposition;
NLTK provides documentation for each tag, which can be queried using the tag,
>>> nltk.help.upenn_tagset(‘RB’)
RB: adverb
occasionally unabatingly maddeningly adventurously professedly
stirringly prominently technologically magisterially predominately
swiftly fiscally pitilessly …
>>> nltk.help.upenn_tagset(‘RB’)
RB: adverb
Copyright @ 2019 Learntek. All Rights Reserved. 9
occasionally unabatingly maddeningly adventurously professedly
stirringly prominently technologically magisterially predominately
swiftly fiscally pitilessly …
>>> nltk.help.upenn_tagset(‘NN’)
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist …
>>> nltk.help.upenn_tagset(‘NNP’)
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool …
Copyright @ 2019 Learntek. All Rights Reserved. 10
>>> nltk.help.upenn_tagset(‘CC’)
CC: conjunction, coordinating
& ‘n and both but either et for less minus neither nor or plus so
therefore times v. versus vs. whether yet
>>> nltk.help.upenn_tagset(‘DT’)
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
>>> nltk.help.upenn_tagset(‘TO’)
TO: “to” as preposition or infinitive marker
to
>>> nltk.help.upenn_tagset(‘VB’)
VB: verb, base form
Copyright @ 2019 Learntek. All Rights Reserved. 11
ask assemble assess assign assume atone attention avoid bake balkanize
bank begin behold believe bend benefit bevel beware bless boil bomb
boost brace break bring broil brush build …
Copyright @ 2019 Learntek. All Rights Reserved. 12
The POS tagger in the NLTK library outputs specific tags for certain words. The list of
POS tags is as follows, with examples of what each POS stands for.
•CC coordinating conjunction
•CD cardinal digit
•DT determiner
•EX existential there (like: “there is” … think of it like “there exists”)
•FW foreign word
•IN preposition/subordinating conjunction
•JJ adjective ‘big’
•JJR adjective, comparative ‘bigger’
•JJS adjective, superlative ‘biggest’
•LS list marker 1)
Copyright @ 2019 Learntek. All Rights Reserved. 13
•MD modal could, will
•NN noun, singular ‘desk’
•NNS noun plural ‘desks’
•NNP proper noun, singular ‘Harrison’
•NNPS proper noun, plural ‘Americans’
•PDT predeterminer ‘all the kids’
•POS possessive ending parent’s
•PRP personal pronoun I, he, she
•PRP$ possessive pronoun my, his, hers
•RB adverb very, silently,
•RBR adverb, comparative better
•RBS adverb, superlative best
Copyright @ 2019 Learntek. All Rights Reserved. 14
•RP particle give up
•TO, to go ‘to’ the store.
•UH interjection, errrrrrrrm
•VB verb, base form take
•VBD verb, past tense took
•VBG verb, gerund/present participle taking
•VBN verb, past participle taken
•VBP verb, sing. present, non-3d take
•VBZ verb, 3rd person sing. present takes
•WDT wh-determiner which
•WP wh-pronoun who, what
•WP$ possessive wh-pronoun whose
•WRB wh-abverb where, when
Copyright @ 2019 Learntek. All Rights Reserved. 15
Tagged Corpora
Representing Tagged Tokens
A tagged token is represented using a tuple consisting of the token and the tag. We
can create one of these special tuples from the standard string representation of a
tagged token, using the function str2tuple():
>>> tagged_token = nltk.tag.str2tuple('Learn/VB’)
>>> tagged_token
('Learn', 'VB’)
>>> tagged_token[0]
'Learn’
>>> tagged_token[1]
'VB'
Copyright @ 2019 Learntek. All Rights Reserved. 16
Copyright @ 2019 Learntek. All Rights Reserved. 17
Reading Tagged Corpora
Several of the corpora included with NLTK have been tagged for their part-of-
speech. Here’s an example of what you might see if you opened a file from the
Brown Corpus with a text editor:
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
>>> nltk.corpus.brown.tagged_words(tagset='universal’)
[('The', 'DET'), ('Fulton', 'NOUN'), ...]
>>> [('The', 'DET'), ('Fulton', 'NOUN'), ...]
Copyright @ 2015 Learntek. All Rights Reserved. 18
Copyright @ 2019 Learntek. All Rights Reserved. 19
Part of Speech Tag set
Tagged corpora use many different conventions for tagging words.
Tag Meaning English Examples
ADJ adjective new, good, high, special, big, local
ADP adposition on, of, at, with, by, into,under
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no,which
NOUN noun year, home,costs, time, Africa
NUM numeral twenty-four,fourth, 1991, 14:24
PRT particle at, on, out, over per, that, up, with
PRON pronoun he, their, her, its, my, I, us
VERB verb is, say, told, given, playing,would
. punctuationmarks . , ; !
X other ersatz, esprit, dunno,gr8, univeristy
Copyright @ 2019 Learntek. All Rights Reserved. 20
>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_words(categories='adventure',
tagset='universal’)
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>> tag_fd.most_common()
Output
[('NOUN', 13354), ('VERB', 12274), ('.', 10929), ('DET', 8155), ('ADP', 7069), ('PRON', 5205), ('ADV', 3879), ('ADJ', 3364), ('PRT', 2436), ('CONJ', 2173),
('NUM', 466), ('X', 38)]
Copyright @ 2019 Learntek. All Rights Reserved. 21
Nouns
Nouns generally refer to people, places, things, or concepts, for example.: woman,
Scotland, book, intelligence. The simplified noun tags are N for common nouns like
book, and NP for proper nouns like Scotland.
>>> word_tag_pairs = nltk.bigrams(brown_news_tagged)
>>> noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN’]
>>> fdist = nltk.FreqDist(noun_preceders)
>>> [tag for (tag, _) in fdist.most_common()]
['DET', 'ADJ', 'NOUN', 'ADP', '.', 'VERB', 'CONJ', 'NUM', 'ADV', 'PRON', 'PRT', 'X']
Copyright @ 2015 Learntek. All Rights Reserved. 22
Copyright @ 2019 Learntek. All Rights Reserved. 23
Verbs
Looking for verbs in the news text and sorting by frequency
>>> wsj = nltk.corpus.treebank.tagged_words(tagset='universal’)
>>> brown_news_tagged = brown.tagged_words(categories='adventure', tagset='universal’)
>>> wsj = nltk.corpus.treebank.tagged_words(tagset='universal’)
>>> [wt[0] for (wt, _) in word_tag_fd.most_common(200) if wt[1] == 'VERB’]
['is', 'said', 'was', 'are', 'be', 'has', 'have', 'will', 'says', 'would', 'were', 'had', 'been', 'could', "'s", 'can', 'do', 'say', 'make',
'may', 'did', 'rose', 'made', 'does', 'expected', 'buy', 'take', 'get']
Copyright @ 2015 Learntek. All Rights Reserved. 24
Copyright @ 2019 Learntek. All Rights Reserved. 25
For more Online Training Courses, Please
contact
Email : info@learntek.org
USA : +1734 418 2465
India : +91 40 4018 1306
+91 7799713624

More Related Content

PPTX
Categorizing and pos tagging with nltk python
PPT
Concatenative bangla speech synthesizer model
PDF
Building Context Aware P2P Systems with the Shark Framework
PPTX
Infosys' InStep Guidance 2011 for Chulalongkorn University
PPT
PPTX
The Marriage between Music and Machine Learning in KKBOX

PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PPTX
Named entity recognition (ner) with nltk
Categorizing and pos tagging with nltk python
Concatenative bangla speech synthesizer model
Building Context Aware P2P Systems with the Shark Framework
Infosys' InStep Guidance 2011 for Chulalongkorn University
The Marriage between Music and Machine Learning in KKBOX

Welcome to International Journal of Engineering Research and Development (IJERD)
Named entity recognition (ner) with nltk

Similar to Categorizing and pos tagging with nltk python (20)

PDF
Beginning text analysis
PPTX
KiwiPyCon 2014 - NLP with Python tutorial
DOCX
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
PDF
Giving Code a Good Name
PDF
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
PDF
Tips And Tricks For Bioinformatics Software Engineering
PDF
"From IA to AI in Healthcare" - Walter De Brouwer (CEO/Founder, doc.ai/Scanadu)
PPTX
Cyber Security Workshop Presentation.pptx
PDF
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...
PDF
支撐英雄聯盟戰績網的那條巨蟒
PPTX
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
PPTX
Nltk sentiment analysis
PPTX
Natural Language processing using nltk.pptx
PPSX
Nltk - Boston Text Analytics
PPTX
ETL into Neo4j
PPT
Modware
PDF
Nltk:a tool for_nlp - py_con-dhaka-2014
PPTX
Rsockets ofa12
PDF
Spring, CDI, Jakarta EE good parts
PDF
How To Build And Launch A Successful Globalized App From Day One Or All The ...
Beginning text analysis
KiwiPyCon 2014 - NLP with Python tutorial
JNTUK r20 AIML SOC NLP-LAB-MANUAL-R20.docx
Giving Code a Good Name
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Tips And Tricks For Bioinformatics Software Engineering
"From IA to AI in Healthcare" - Walter De Brouwer (CEO/Founder, doc.ai/Scanadu)
Cyber Security Workshop Presentation.pptx
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...
支撐英雄聯盟戰績網的那條巨蟒
The Neural Search Frontier - Doug Turnbull, OpenSource Connections
Nltk sentiment analysis
Natural Language processing using nltk.pptx
Nltk - Boston Text Analytics
ETL into Neo4j
Modware
Nltk:a tool for_nlp - py_con-dhaka-2014
Rsockets ofa12
Spring, CDI, Jakarta EE good parts
How To Build And Launch A Successful Globalized App From Day One Or All The ...
Ad

More from Janu Jahnavi (20)

PDF
Analytics using r programming
PDF
Software testing
PPTX
Software testing
PPTX
Spring
PDF
Stack skills
PPTX
Ui devopler
PPTX
Apache flink
PDF
Apache flink
PDF
Angular js
PDF
Mysql python
PPTX
Mysql python
PDF
Ruby with cucmber
PPTX
Apache kafka
PDF
Apache kafka
PPTX
Google cloud platform
PPTX
Google cloud Platform
PDF
Apache spark with java 8
PPTX
Apache spark with java 8
PDF
Python multithreading
PPTX
Python multithreading
Analytics using r programming
Software testing
Software testing
Spring
Stack skills
Ui devopler
Apache flink
Apache flink
Angular js
Mysql python
Mysql python
Ruby with cucmber
Apache kafka
Apache kafka
Google cloud platform
Google cloud Platform
Apache spark with java 8
Apache spark with java 8
Python multithreading
Python multithreading
Ad

Recently uploaded (20)

PDF
Computing-Curriculum for Schools in Ghana
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
master seminar digital applications in india
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
01-Introduction-to-Information-Management.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
RMMM.pdf make it easy to upload and study
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Computing-Curriculum for Schools in Ghana
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Complications of Minimal Access Surgery at WLH
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
master seminar digital applications in india
Module 4: Burden of Disease Tutorial Slides S2 2025
A systematic review of self-coping strategies used by university students to ...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
STATICS OF THE RIGID BODIES Hibbelers.pdf
VCE English Exam - Section C Student Revision Booklet
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
01-Introduction-to-Information-Management.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Microbial disease of the cardiovascular and lymphatic systems
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
RMMM.pdf make it easy to upload and study
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Final Presentation General Medicine 03-08-2024.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf

Categorizing and pos tagging with nltk python

  • 1. Categorizing and POS Tagging with NLTK Python
  • 2. CHAPTER – 4 THE BASICS OF SEARCH ENGINE FRIENDLY DESIGN & DEVELOPMENT
  • 3. Copyright @ 2019 Learntek. All Rights Reserved. 3 Categorizing and POS Tagging with NLTK Python Natural language processing is a sub-area of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (native) languages. This is nothing but how to program computers to process and analyze large amounts of natural language data. NLP = Computer Science + AI + Computational Linguistics In another way, Natural language processing is the capability of computer software to understand human language as it is spoken. NLP is one of the component of artificial intelligence (AI).
  • 4. Copyright @ 2019 Learntek. All Rights Reserved. 4 About NLTK : The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. •It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. •A software package for manipulating linguistic data and performing NLP tasks. •NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning
  • 5. Copyright @ 2019 Learntek. All Rights Reserved. 5 •NLTK supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. •NLTK includes more than 50 corpora and lexical sources such as the Penn Treebank Corpus, Open Multilingual Wordnet, Problem Report Corpus, and Lin’s Dependency Thesaurus. The process of classifying words into their parts of speech and labelling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tag set.
  • 6. Copyright @ 2019 Learntek. All Rights Reserved. 6 Using a Tagger A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word. To do this first we have to use tokenization concept (Tokenization is the process by dividing the quantity of text into smaller parts called tokens.)
  • 7. Copyright @ 2019 Learntek. All Rights Reserved. 7 >>> import nltk >>>from nltk.tokenize import word_tokenize >>> text = word_tokenize("Hello welcome to the world of to learn Categorizing and POS Tagging with NLTK and Python") >>> nltk.pos_tag(text) OUTPUT: [('Hello', 'NNP'), ('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('world', 'NN'), ('of', 'IN'), ('to', 'TO'), ('learn', 'VB'), ('Categorizing', 'NNP'), ('and', 'CC'), ('POS', 'NNP'), ('Tagging', 'NNP'), ('with', 'IN'), ('NLTK', 'NNP'), ('and', 'CC'), ('Python', 'NNP')]
  • 8. Copyright @ 2019 Learntek. All Rights Reserved. 8 In the above output and is CC, a coordinating conjunction; Learn is VB, or verbs; for is IN, a preposition; NLTK provides documentation for each tag, which can be queried using the tag, >>> nltk.help.upenn_tagset(‘RB’) RB: adverb occasionally unabatingly maddeningly adventurously professedly stirringly prominently technologically magisterially predominately swiftly fiscally pitilessly … >>> nltk.help.upenn_tagset(‘RB’) RB: adverb
  • 9. Copyright @ 2019 Learntek. All Rights Reserved. 9 occasionally unabatingly maddeningly adventurously professedly stirringly prominently technologically magisterially predominately swiftly fiscally pitilessly … >>> nltk.help.upenn_tagset(‘NN’) NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist … >>> nltk.help.upenn_tagset(‘NNP’) NNP: noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool …
  • 10. Copyright @ 2019 Learntek. All Rights Reserved. 10 >>> nltk.help.upenn_tagset(‘CC’) CC: conjunction, coordinating & ‘n and both but either et for less minus neither nor or plus so therefore times v. versus vs. whether yet >>> nltk.help.upenn_tagset(‘DT’) DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those >>> nltk.help.upenn_tagset(‘TO’) TO: “to” as preposition or infinitive marker to >>> nltk.help.upenn_tagset(‘VB’) VB: verb, base form
  • 11. Copyright @ 2019 Learntek. All Rights Reserved. 11 ask assemble assess assign assume atone attention avoid bake balkanize bank begin behold believe bend benefit bevel beware bless boil bomb boost brace break bring broil brush build …
  • 12. Copyright @ 2019 Learntek. All Rights Reserved. 12 The POS tagger in the NLTK library outputs specific tags for certain words. The list of POS tags is as follows, with examples of what each POS stands for. •CC coordinating conjunction •CD cardinal digit •DT determiner •EX existential there (like: “there is” … think of it like “there exists”) •FW foreign word •IN preposition/subordinating conjunction •JJ adjective ‘big’ •JJR adjective, comparative ‘bigger’ •JJS adjective, superlative ‘biggest’ •LS list marker 1)
  • 13. Copyright @ 2019 Learntek. All Rights Reserved. 13 •MD modal could, will •NN noun, singular ‘desk’ •NNS noun plural ‘desks’ •NNP proper noun, singular ‘Harrison’ •NNPS proper noun, plural ‘Americans’ •PDT predeterminer ‘all the kids’ •POS possessive ending parent’s •PRP personal pronoun I, he, she •PRP$ possessive pronoun my, his, hers •RB adverb very, silently, •RBR adverb, comparative better •RBS adverb, superlative best
  • 14. Copyright @ 2019 Learntek. All Rights Reserved. 14 •RP particle give up •TO, to go ‘to’ the store. •UH interjection, errrrrrrrm •VB verb, base form take •VBD verb, past tense took •VBG verb, gerund/present participle taking •VBN verb, past participle taken •VBP verb, sing. present, non-3d take •VBZ verb, 3rd person sing. present takes •WDT wh-determiner which •WP wh-pronoun who, what •WP$ possessive wh-pronoun whose •WRB wh-abverb where, when
  • 15. Copyright @ 2019 Learntek. All Rights Reserved. 15 Tagged Corpora Representing Tagged Tokens A tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple(): >>> tagged_token = nltk.tag.str2tuple('Learn/VB’) >>> tagged_token ('Learn', 'VB’) >>> tagged_token[0] 'Learn’ >>> tagged_token[1] 'VB'
  • 16. Copyright @ 2019 Learntek. All Rights Reserved. 16
  • 17. Copyright @ 2019 Learntek. All Rights Reserved. 17 Reading Tagged Corpora Several of the corpora included with NLTK have been tagged for their part-of- speech. Here’s an example of what you might see if you opened a file from the Brown Corpus with a text editor: >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ...] >>> nltk.corpus.brown.tagged_words(tagset='universal’) [('The', 'DET'), ('Fulton', 'NOUN'), ...] >>> [('The', 'DET'), ('Fulton', 'NOUN'), ...]
  • 18. Copyright @ 2015 Learntek. All Rights Reserved. 18
  • 19. Copyright @ 2019 Learntek. All Rights Reserved. 19 Part of Speech Tag set Tagged corpora use many different conventions for tagging words. Tag Meaning English Examples ADJ adjective new, good, high, special, big, local ADP adposition on, of, at, with, by, into,under ADV adverb really, already, still, early, now CONJ conjunction and, or, but, if, while, although DET determiner, article the, a, some, most, every, no,which NOUN noun year, home,costs, time, Africa NUM numeral twenty-four,fourth, 1991, 14:24 PRT particle at, on, out, over per, that, up, with PRON pronoun he, their, her, its, my, I, us VERB verb is, say, told, given, playing,would . punctuationmarks . , ; ! X other ersatz, esprit, dunno,gr8, univeristy
  • 20. Copyright @ 2019 Learntek. All Rights Reserved. 20 >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='adventure', tagset='universal’) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.most_common() Output [('NOUN', 13354), ('VERB', 12274), ('.', 10929), ('DET', 8155), ('ADP', 7069), ('PRON', 5205), ('ADV', 3879), ('ADJ', 3364), ('PRT', 2436), ('CONJ', 2173), ('NUM', 466), ('X', 38)]
  • 21. Copyright @ 2019 Learntek. All Rights Reserved. 21 Nouns Nouns generally refer to people, places, things, or concepts, for example.: woman, Scotland, book, intelligence. The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland. >>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN’] >>> fdist = nltk.FreqDist(noun_preceders) >>> [tag for (tag, _) in fdist.most_common()] ['DET', 'ADJ', 'NOUN', 'ADP', '.', 'VERB', 'CONJ', 'NUM', 'ADV', 'PRON', 'PRT', 'X']
  • 22. Copyright @ 2015 Learntek. All Rights Reserved. 22
  • 23. Copyright @ 2019 Learntek. All Rights Reserved. 23 Verbs Looking for verbs in the news text and sorting by frequency >>> wsj = nltk.corpus.treebank.tagged_words(tagset='universal’) >>> brown_news_tagged = brown.tagged_words(categories='adventure', tagset='universal’) >>> wsj = nltk.corpus.treebank.tagged_words(tagset='universal’) >>> [wt[0] for (wt, _) in word_tag_fd.most_common(200) if wt[1] == 'VERB’] ['is', 'said', 'was', 'are', 'be', 'has', 'have', 'will', 'says', 'would', 'were', 'had', 'been', 'could', "'s", 'can', 'do', 'say', 'make', 'may', 'did', 'rose', 'made', 'does', 'expected', 'buy', 'take', 'get']
  • 24. Copyright @ 2015 Learntek. All Rights Reserved. 24
  • 25. Copyright @ 2019 Learntek. All Rights Reserved. 25 For more Online Training Courses, Please contact Email : [email protected] USA : +1734 418 2465 India : +91 40 4018 1306 +91 7799713624