Using CountVectorizer to Extracting Features from Text Last Updated : 07 Jul, 2022 Comments Improve Suggest changes Like Article Like Report CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). Let us consider a few sample texts from a document (each as a list element): document = [ "One Geek helps Two Geeks", "Two Geeks help Four Geeks", "Each Geek helps many other Geeks at GeeksforGeeks."] CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. This can be visualized as follows - ateachfourgeekgeeksgeeksforgeekshelphelpsmanyoneothertwodocument[0]000110010101document[1]001020100001document[2]110111011010 Key Observations: There are 12 unique words in the document, represented as columns of the table.There are 3 text samples in the document, each represented as rows of the table.Every cell contains a number, that represents the count of the word in that particular text.All words have been converted to lowercase.The words in columns have been arranged alphabetically. Inside CountVectorizer, these words are not stored as strings. Rather, they are given a particular index value. In this case, 'at' would have index 0, 'each' would have index 1, 'four' would have index 2 and so on. The actual representation has been shown in the table below - A Sparse Matrix This way of representation is known as a Sparse Matrix. Code: Python implementation of CountVectorizer python3 from sklearn.feature_extraction.text import CountVectorizer document = ["One Geek helps Two Geeks", "Two Geeks help Four Geeks", "Each Geek helps many other Geeks at GeeksforGeeks"] # Create a Vectorizer Object vectorizer = CountVectorizer() vectorizer.fit(document) # Printing the identified Unique words along with their indices print("Vocabulary: ", vectorizer.vocabulary_) # Encode the Document vector = vectorizer.transform(document) # Summarizing the Encoded Texts print("Encoded Document is:") print(vector.toarray()) Output: Vocabulary: {'one': 9, 'geek': 3, 'helps': 7, 'two': 11, 'geeks': 4, 'help': 6, 'four': 2, 'each': 1, 'many': 8, 'other': 10, 'at': 0, 'geeksforgeeks': 5} Encoded Document is: [ [0 0 0 1 1 0 0 1 0 1 0 1] [0 0 1 0 2 0 1 0 0 0 0 1] [1 1 0 1 1 1 0 1 1 0 1 0] ] Comment More infoAdvertise with us Next Article Using CountVectorizer to Extracting Features from Text K khushali_verma Follow Improve Article Tags : NLP AI-ML-DS AI-ML-DS With Python Similar Reads Text Feature Extraction using HuggingFace Model Text feature extraction converts text data into a numerical format that machine learning algorithms can understand. This preprocessing step is important for efficient, accurate, and interpretable models in natural language processing (NLP). We will discuss more about text feature extraction in this 4 min read Understanding min_df and max_df in scikit CountVectorizer In natural language processing (NLP), text preprocessing is a critical step that significantly impacts the performance of machine learning models. One common preprocessing step is converting text data into numerical data using techniques like Bag of Words (BoW). The CountVectorizer class in the scik 6 min read Feature Extraction Techniques - NLP Introduction : This article focuses on basic feature extraction techniques in NLP to analyse the similarities between pieces of text. Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) 10 min read Clustering Text Documents using K-Means in Scikit Learn Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-M 3 min read Construct a Tokens Object Using Quanteda in R One of the most basic processes in the case of text analysis is tokenization, which means breaking down text into manageable units like words or phrases for further examination. The R quanteda package provides a strong and flexible framework to do this very important step. This is possible through t 3 min read NLP | Extracting Named Entities Recognizing named entity is a specific kind of chunk extraction that uses entity tags along with chunk tags. Common entity tags include PERSON, LOCATION and ORGANIZATION. POS tagged sentences are parsed into chunk trees with normal chunking but the trees labels can be entity tags in place of chunk p 2 min read An Easy Approach to TF-IDF Using R TF-IDF (Term Frequency-Inverse Document Frequency) is a fundamental technique in natural language processing and information retrieval for assessing the importance of a term within a document relative to a collection of documents. In this article, we'll explore how to implement TF-IDF using R Progra 5 min read Transform Text Features to Numerical Features with CatBoost Handling text and category data is essential to machine learning to create correct prediction models. Yandex's gradient boosting library, CatBoost, performs very well. It provides sophisticated methods to convert text characteristics into numerical ones and supports categorical features natively, bo 4 min read Accessing Text Corpora and Lexical Resources using NLTK Accessing Text Corpora and Lexical Resources using NLTK provides efficient access to extensive text data and linguistic resources, empowering researchers and developers in natural language processing tasks. Natural Language Toolkit (NLTK) is a powerful Python library for natural language processing 5 min read Text2Text Generations using HuggingFace Model Text2Text generation is a versatile and powerful approach in Natural Language Processing (NLP) that involves transforming one piece of text into another. This can include tasks such as translation, summarization, question answering, and more. HuggingFace, a leading provider of NLP tools, offers a ro 5 min read Like