
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Classification of Text Documents Using Naive Bayes in Python
Naive Bayes algorithm is a powerful tool that one can use to classify the words of a document or text in different categories. As an example, if a document has words like ?humid', ?rainy', or ?cloudy', then we can use the Bayes algorithm to check if this document falls in the category of a ?sunny day' or a ?rainy day'.
Note that the Naive Bayes algorithm works on the assumption that the words of the two documents under comparison are independent of each other. However, given the nuances of language, it is rarely true. This is why the algorithm's name has the term ?naive' in it but nonetheless, it performs well enough.
Algorithm
Step 1 ? Input the number of documents, text strings and corresponding classes. Do the needful splitting of text and keywords using lists and input the string/text to be classified.
Step 2 ? Create a list where the frequency of all keywords of each document will be stored. Print this in tabular form using the pretty table library. Name the headings as required.
Step 3 ? Count the number of total words and documents belonging to each class, Positive and Negative.
Step 4 ? Find the probability of each word and round it off to 4-digits precision.
Step 5 ? Find the probability for class using Bayes formula and round it off to 8-digits precision.
Step 6 ? Find the probability for class using Bayes formula and round it off to 8-digits precision.
Step 7 ? Repeat the above two steps for the Negative class.
Step 8 ? Compare the resultant probabilities of both the classes and print the result.
Example
In this example, for the sake of simplicity and understanding, we will take only two documents containing one sentence each and perform Naive Bayes classification on a string similar to both these sentences. Also, there will be a class for each document and our aim is to reach the conclusion as to which class the string under test belongs to.
#Step 1 - Input the required data and split the text and keywords total_documents = 2 text_list = ["they love laugh and pray", "without faith you suffer"] category_list = ["Positive", "Negative"] doc_class = [] i = 0 keywords = [] while not i == total_documents: doc_class.append([]) text = text_list[i] category = category_list[i] doc_class[i].append(text.split()) doc_class[i].append(category) keywords.extend(text.split()) i = i+1 keywords = set(keywords) keywords = list(keywords) keywords.sort() to_find = "suffer without love laugh and pray" #step 2 - make frequency table for keywords and print the table probability_table = [] for i in range(total_documents): probability_table.append([]) for j in keywords: probability_table[i].append(0) doc_id = 1 for i in range(total_documents): for k in range(len(keywords)): if keywords[k] in doc_class[i][0]: probability_table[i][k] += doc_class[i][0].count(keywords[k]) print('\n') import prettytable keywords.insert(0, 'Document Number') keywords.append("Class/Category") Prob_Table = prettytable.PrettyTable() Prob_Table.field_names = keywords Prob_Table.title = 'Probability table' x=0 for i in probability_table: i.insert(0,x+1) i.append(doc_class[x][1]) Prob_Table.add_row(i) x=x+1 print(Prob_Table) print('\n') for i in probability_table: i.pop(0) #step 3 - count the words and documents based on categories totalpluswords=0 totalnegwords=0 totalplus=0 totalneg=0 vocabulary=len(keywords)-2 for i in probability_table: if i[len(i)-1]=="+": totalplus+=1 totalpluswords+=sum(i[0:len(i)-1]) else: totalneg+=1 totalnegwords+=sum(i[0:len(i)-1]) keywords.pop(0) keywords.pop(len(keywords)-1) #step - 4 Find probability of each word for positive class temp=[] for i in to_find: count=0 x=keywords.index(i) for j in probability_table: if j[len(j)-1]=="Positive": count=count+j[x] temp.append(count) count=0 for i in range(len(temp)): temp[i]=format((temp[i]+1)/(vocabulary+totalpluswords),".4f") print() temp=[float(f) for f in temp] print("Probabilities of each word in the 'Positive' category are: ") h=0 for i in to_find: print(f"P({i}/+) = {temp[h]}") h=h+1 print() #step - 5 Find probability of class using Bayes formula prob_pos=float(format((totalplus)/(totalplus+totalneg),".8f")) for i in temp: prob_pos=prob_pos*i prob_pos=format(prob_pos,".8f") print("Probability of text in 'Positive' class is :",prob_pos) print() #step - 6 Repeat above two steps for the negative class temp=[] for i in to_find: count=0 x=keywords.index(i) for j in probability_table: if j[len(j)-1]=="Negative": count=count+j[x] temp.append(count) count=0 for i in range(len(temp)): temp[i]=format((temp[i]+1)/(vocabulary+totalnegwords),".4f") print() temp=[float(f) for f in temp] print("Probabilities of each word in the 'Negative' category are: ") h=0 for i in to_find: print(f"P({i}/-) = {temp[h]}") h=h+1 print() prob_neg=float(format((totalneg)/(totalplus+totalneg),".8f")) for i in temp: prob_neg=prob_neg*i prob_neg=format(prob_neg,".8f") print("Probability of text in 'Negative' class is :",prob_neg) print('\n') #step - 7 Compare the probabilities and print the result if prob_pos>prob_neg: print(f"By Naive Bayes Classification, we can conclude that the given belongs to 'Positive' class with the probability {prob_pos}") else: print(f"By Naive Bayes Classification, we can conclude that the given belongs to 'Negative' class with the probability {prob_neg}") print('\n')
We iterate over each document to store the keywords in a separate list. We store the frequency of the keywords by iterating over the documents and plot a probability table. The code calculates the number of positive and negative words in the document and determines the size of unique keywords.
Then, we calculate the probability of each keyword in the positive category and iterate over the keywords in the input text and count the occurrences in the positive category. The resulting probabilities are then stored in a new list. We then calculate the probability of input text that belongs to the positive category using the Baye's Formula. Similarly, we calculate probability for each keyword in negative category and store them. We then compare the probabilities of both the categories and determine the category with higher probability.
Output
Probabilities of each word in the 'Positive' category are: P(suffer/+) = 0.1111 P(without/+) = 0.1111 P(love/+) = 0.2222 P(laugh/+) = 0.2222 P(and/+) = 0.2222 P(pray/+) = 0.2222 Probability of text in 'Positive' class is : 0.00000000 Probabilities of each word in the 'Negative' category are: P(suffer/-) = 0.1111 P(without/-) = 0.1111 P(love/-) = 0.0556 P(laugh/-) = 0.0556 P(and/-) = 0.0556 P(pray/-) = 0.0556 Probability of text in 'Negative' class is : 0.00000012 By Naive Bayes Classification, we can conclude that the given belongs to 'Negative' class with the probability 0.00000012
Conclusion
The Naive Bayes Algorithm is one such algorithm that works very well without much training. However, for any new data that is not present in the documents, the algorithm might give absurd results or errors. Nonetheless, the algorithm finds great use in real-time predictions and filtering based functionalities. Other such classification algorithms include Logistic Regression, Decision Tree and Random Forest, etc.