Transform Text Features to Numerical Features with CatBoost
Last Updated :
23 Jul, 2025
Handling text and category data is essential to machine learning to create correct prediction models. Yandex's gradient boosting library, CatBoost, performs very well. It provides sophisticated methods to convert text characteristics into numerical ones and supports categorical features natively, both of which may greatly enhance model performance. This article will focus on how to transform text features into numerical features using CatBoost, enhancing the model's predictive power.
Text Processing in CatBoost
Text features in CatBoost are used to build new numeric features. These features are essential for tasks involving natural language processing (NLP), where raw text data needs to be converted into a format that machine learning models can understand and process effectively.
There are many processes involved in CatBoost's text processing:
- Tokenization: The process of dividing text into relevant tokens.
- Embedding: Changing tokens into vectors of numbers.
- Aggregation: Creating fixed-length numerical characteristics by summing these vectors.
Handling Text Features in CatBoost
When dealing with text features, it is crucial to ensure that the order of columns in the training and test datasets matches. This can be managed by using the Pool
method in CatBoost, where columns can be added by name.
Example of Using Text Features:
model.fit(x_train, y_train, text_features=['text'])
For prediction, ensure the text features are correctly specified:
preds_class = model.predict(X_test)
Steps to Transform Text Features to Numerical Features
1. Loading and Storing Text Features
Text features are loaded into CatBoost similarly to other feature types. They can be specified in the column descriptions file or directly in the Python package using the text_features
parameter.
2. Preprocessing Text Features
CatBoost uses dictionaries and tokenizers to preprocess text features. The dictionaries define how text data is converted into tokens, while tokenizers break down the text into these tokens.
Example of a Dictionary:
dictionaries = [{
'dictionaryId': 'Unigram',
'max_dictionary_size': '50000',
'gram_count': '1',
}, {
'dictionaryId': 'Bigram',
'max_dictionary_size': '50000',
'gram_count': '2',
}]
Example of a Tokenizer:
tokenizers = [{
'tokenizerId': 'Space',
'delimiter': ' ',
}]
3. Calculating New Features
Feature calculators (feature calcers) are used to generate new numeric features from the preprocessed text data. These calculators can include methods like Bag of Words (BoW), Naive Bayes, and others.
Example of Feature Calcers:
feature_calcers = [
'BoW:top_tokens_count=1000',
'NaiveBayes',
]
4. Training the Model
Once the text features are preprocessed and new numeric features are calculated, they are passed to the regular CatBoost training algorithm.
Text Features to Numerical Features using CatBoost : Implementation
Step 1: Install CatBoost and Import CatBoost
Ensure you have CatBoost installed:
!pip install catboost
Importing CatBoost
Python
from catboost import CatBoostClassifier, Pool
import pandas as pd
Step 2: Prepare Dataset
We'll illustrate the procedure using an example dataset. Here, categorical characteristics like "City" and "Weather" are present in the dataset:
Python
data = {
'City': ['New York', 'London', 'Tokyo', 'New York', 'Tokyo'],
'Weather': ['Sunny', 'Rainy', 'Sunny', 'Snowy', 'Rainy'],
'Label': [1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)
Step 3: Define Features and Target
Determine the target variable and its characteristics:
Python
X = df[['City', 'Weather']]
y = df['Label']
Step 4: Initialize and Train the Model
Establish categorical characteristics and set the CatBoostClassifier's initialization, To manage the data and indicate which characteristics are categorical, create a Pool object as follows:
Python
categorical_features = ['City', 'Weather']
model = CatBoostClassifier(iterations=100, depth=3, learning_rate=0.1, loss_function='Logloss')
train_pool = Pool(data=X, label=y, cat_features=categorical_features)
model.fit(train_pool)
Step 5: View Transformed Features
During training, CatBoost internally modifies the category characteristics. You may access the feature importances in order to examine the altered features:
Python
importances = model.get_feature_importance(train_pool, prettified=True)
print(importances)
Output:
Feature Id Importances
0 City 82.857487
1 Weather 17.142513
Conclusion
Transforming text features into numerical features in CatBoost involves preprocessing text data using dictionaries and tokenizers, calculating new numeric features with feature calcers, and then training the model. This process enhances the model's ability to handle text data effectively, making CatBoost a robust tool for NLP tasks. By following the steps outlined in this article, you can leverage CatBoost's capabilities to transform and utilize text features in your machine learning models, improving their predictive performance.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis. Revolves around two primary Data structures: Series (1D) and DataFrame (2D)Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformat
6 min read
NumPy Tutorial - Python LibraryNumPy is a core Python library for numerical computing, built for handling large arrays and matrices efficiently.ndarray object â Stores homogeneous data in n-dimensional arrays for fast processing.Vectorized operations â Perform element-wise calculations without explicit loops.Broadcasting â Apply
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice