Boost Embedding Model Accuracy for Custom Information Retrieval

Customizing embedding models is crucial for effective information retrieval, especially when working with domain-specific data like legal text, medical records, or multi-turn customer conversations. Generic, open-domain models often struggle to capture the nuances and structure of such specialized content.

Coxwave Align, an analytics platform for conversational-AI products, is using NVIDIA NeMo Curator to build a high-quality, domain-specific dataset to fine-tune embedding models. This customization improved ‌semantic alignment between queries and documents, outperforming both open- and closed-source alternatives in retrieval accuracy.

These fine-tuned embeddings were integrated into Coxwave’s retrieval-augmented generation (RAG) pipeline, serving as the foundation for the retriever component. The enhanced retriever produced more relevant candidate documents, which were then passed to a reranker for deeper evaluation before reaching the generation step.

Contrary to the intuitive “more data equals better performance” approach, Coxwave found that rigorous data curation was far more impactful than simply increasing dataset size. The time invested in preprocessing and removing redundant patterns was minimal compared to the 6x reduction in training time, while the resulting models showed better generalization and less overfitting to similar conversational structures.

While fine-tuning introduced potential latency and scalability trade-offs, careful data curation enabled the Coxwave team to use smaller, optimized models. This led to faster inference times and fewer documents needing reranking, making the system both accurate and efficient in production.

In this blog post, we cover the NeMo Curator features that the Coxwave team used to build their data processing pipelines and curate high-quality data. This post serves as an inspiration for other enterprises and developers to think through the various decisions in fine-tuning their embedding models to improve the overall accuracy of their multi-turn retrieval systems.

Retrieving multi-turn conversations

Coxwave Align is an advanced analytics engine for conversational AI applications. It helps teams analyze conversations to determine when users are satisfied or dissatisfied, identify patterns in dialogue that could lead to new revenue opportunities, and understand why some users engage in longer conversations, for product improvement through data-driven insights.

Unlike traditional information retrieval (IR) systems designed to search across static documents, their model is specifically optimized for searching within dynamic conversation histories. This shift in domain introduces unique challenges: the structure, semantics, and flow of conversational data differ from those of conventional documents, and the format of user queries often reflects this difference. As a result, traditional IR techniques fall short when applied to conversation data.

To address these challenges, Coxwave fine-tuned its retrieval models to better understand the nuances of conversational context, intent, and turn-based dialogue. To support this, they used the NVIDIA NeMo Curator to curate a high-quality, domain-specific dataset tailored to these conversational use cases.

Image showing the workflow for customizing the embedding model from Coxwave. — *Figure 1. Coxwave’s workflow for building a custom embedding model*

Their approach goes beyond simplistic retrieval methods that fetch the single most relevant response. Instead, their embedding models retrieve the top-K most relevant data points from various conversation turns and segments. The system can analyze and synthesize information across multiple interaction points, creating more comprehensive and contextually appropriate responses for complex queries.

For example, imagine a user asking a product support chatbot:

“Why isn’t my billing statement showing the discount from last month?”

The system then identifies and retrieves information from multiple relevant conversation turns—the turn where the discount was initially discussed, the segment containing billing cycle policies, and the turn where eligibility was confirmed. By synthesizing knowledge across these conversation segments, the system can construct comprehensive responses to complex, context-dependent queries that span multiple interaction points.

Curating high-quality data for fine-tuning

As shown in Figure 2, Coxwave’s team began with about 2.4 million conversation data samples (approximately 9.1 GB) that contained open-source conversation data and synthetic dialogues. The team systematically refined the data through step-by-step filtering using NeMo Curator features, such as exact deduplication, fuzzy deduplication, quality filtering, and semantic deduplication. In the end, the team curated and removed about 76% of the data and curated 605,000 high-quality conversation samples.

Image showing the data curation pipeline for processing conversational data. — *Figure 2. Data curation pipeline for processing conversational data*

“With NeMo Curator, we were able to process data efficiently and curate a high-quality dataset tailored for our embedding model customization, resulting in an accuracy improvement of about 12%. The reduction in training data size cut our training time by 6X (from 32 hours to just 6), substantially improved the model’s convergence speed, and saved 80% in compute costs,” said Sangyeop Kim, AI Research Team Lead at Coxwave.

Let’s go over each of these steps in more detail:

Exact and fuzzy deduplication

The NeMo Curator deduplication modules played a crucial role in preprocessing (cleaning) large conversational datasets. Among these, exact and fuzzy deduplication excelled at identifying conversations with slight variations, often resulting from prompt engineering or rephrasings. The exact deduplication module efficiently identifies and removes identical documents by hashing each document and retaining only one per hash, while the fuzzy deduplication module identifies and removes near-duplicate documents by computing MinHash signatures and employing locality sensitive hashing (LSH) to detect documents with high Jaccard similarity.

Using exact and fuzzy deduplication, the Coxwave team filtered out about 5% of the data (from 2.47 to 2.35 million conversations).

Semantic deduplication

The NeMo Curator semantic deduplication module enhances dataset quality by identifying and removing semantically similar documents, even when they are not exact matches, through the use of embeddings and clustering techniques. NeMo Curator uses RAPIDS libraries to accelerate exact, fuzzy, and semantic duplication on GPUs, significantly reducing the data processing time.

Using semantic deduplication, the Coxwave team removed about 57% of the filtered data.

Quality filtering

To curate high-quality data, the Coxwave team used NeMo Curator Quality Classifier, a text classification model that classified documents into “high”, “medium”, or “low” quality. With the quality filtering, the data was further filtered from 1.08 million to 610,000 high-quality conversations

Heuristic filtering

Finally, the Coxwave team leveraged heuristic filters to remove conversations with excessive punctuation, URLs, and repeating information, and removed 5,000 conversations and filtered 605,000 high-quality conversations.

Synthetic data construction

Using the 605,000 or so conversations, the team generated 5 synthetic queries for each conversation (2 positive and 3 hard-negative), creating 3 million query-conversation pairs. The team focused on validating the quality of these pairs by examining the relationship between each query and its corresponding conversation. Through this verification process, they filtered the original 3 million pairs down to 2.5 million high-quality query-conversation pairs that passed their thorough quality tests for the final dataset.

The pipeline Coxwave used is just one example. Enterprises can customize their own by selecting the various features of NeMo Curator that best fit their goals and workflows. Note that in many enterprises, there may not be enough data to evaluate and customize RAG systems. To address this, the NeMo Curator team has provided synthetic data generation pipelines for both evaluating and fine-tuning RAG pipelines. Check out the blog post.

Results

Using a test set of 1,500 queries and 9,100 conversations, the Coxwave team evaluated their fine-tuned model using NDCG@10 to measure ranking quality and Recall@10 to measure how many relevant results were retrieved. They then compared the results with leading open-source and proprietary embedding models. The results were impressive—the fine-tuned model outperformed all the models it was compared with.

Figure 3 shows a bar chart comparing the accuracy of the fine-tuned embedding model to the other models across various thresholds. For both metrics, the fine-tuned model is performing 15-16% better than the next best alternative.

Bar chart showing the comparison of accuracy results of the embedding models for information retrieval. — *Figure 3. Comparison of accuracy results of embedding models for information retrieval*

Since the curated data is smaller, model training time was reduced by approximately 6x, from 32 hours with unprocessed data to 5 hours. The training loss was also greatly reduced, and the oscillation range and interval were relatively small, resulting in ‌stable training.

Bar chart showing comparison of training time for embedding model with raw data and data curated with NeMo Curator. — *Figure 4.* *Comparison of training time for the embedding model with raw data and data curated with NeMo Curator*

Get started

In summary, the Coxwave team used NeMo Curator to curate high-quality conversational data for customizing their embedding model, achieving a 15% accuracy improvement over the next best alternative. NeMo Curator also helped reduce the data size, which in turn decreased the training time by 6x.

To learn more about NeMo Curator’s data processing features and how to use them in your data pipelines, check out the following links.