Deploy and call a RAG-based chatbot - Platform For AI - Alibaba Cloud Documentation Center

Retrieval-Augmented Generation (RAG) technology enhances the capability of large language models (LLMs) in private domain knowledge Q&A by retrieving relevant information from external knowledge bases and merging it with user inputs. EAS provides scenario-based deployment methods that support flexible selection of LLMs and vector databases, enabling the rapid construction and deployment of RAG chatbots. This topic describes how to deploy a RAG-based chatbot and perform model inference.

Step 1: Deploy a RAG service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Model Online Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment area, click RAG-based Smart Dialogue Deployment.

On the RAG-based LLM Chatbot Deployment page, configure the parameters and then click Deploy. When the Service Status changes to Running, the service is successfully deployed. Service deployment usually takes about 5 minutes. The actual duration may differ based on the number of model parameters or other factors. The following table describes the key parameters.

Basic Information

Parameter

Description

Version

The following two versions are supported for deployment:

LLM-Integrated Deployment: Deploy the LLM service and RAG service within the same service.
LLM-Separated Deployment: Deploy only the RAG service. However, you can freely switch and connect the LLM service for greater flexibility.

Model Type

When you select LLM-Integrated Deployment, you need to choose the LLM to deploy. You can select an open-source model based on your specific use case.

Resource Information

Parameter	Description
Deployment Resources	If you select LLM-Integrated Deployment, the system automatically matches an appropriate instance type. If you change the instance type to other instance types, the model service may fail to be started. If you select LLM-Separated Deployment, we recommend that you select the ecs.g6.2xlarge and ecs.g6.4xlarge instance types that include more than 8 vCPUs and 16 GB of memory.

Vector Database Settings

RAG supports building vector databases using Faiss (Facebook AI Similarity Search), Elasticsearch, Hologres, OpenSearch, or ApsaraDB RDS for PostgreSQL. Select a version type based on your scenario to serve as the vector database.

Faiss

Faiss enables the creation of a local vector database efficiently, eliminating the necessity to acquire or activate online vector databases.

Parameter	Description
Version Type	Select FAISS.
OSS Address	Select the OSS storage path created in the current region to store the uploaded knowledge base files. If no storage path is available, you can refer to Quick Start for Console to create one. Note If you choose to use a custom fine-tuned model deployment service, ensure that the selected OSS storage path does not overlap with the path of the custom fine-tuned model to avoid conflicts.

Elasticsearch

Specify the connection information of an Alibaba Cloud Elasticsearch instance. For information on how to create an Elasticsearch instance and prepare configuration items, see Prepare the vector database Elasticsearch.

Parameter	Description
Version Type	Select Elasticsearch.
Private Network Address/port	Configure the private network address and port of the Elasticsearch instance in the format `http://<private network address>:<private network port>`. For information on how to obtain the private network address and port number of an Elasticsearch instance, see View the basic information of an instance.
Index Name	Enter a new index name or an existing index name. For an existing index name, the index schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the index that is automatically created when you deploy the RAG-based chatbot by using EAS.
Account	Specify the logon name that you configured when you created the Elasticsearch instance. The default logon name is elastic.
Password	Specify the password that you configured when you created the Elasticsearch instance. If you forget the logon password, you can reset the instance access password.

Hologres

Specify the connection information of a Hologres instance. If you have not activated a Hologres instance, you can refer to Purchase Hologres for more information.

Parameter	Description
Version Type	Select Hologres.
Invocation Information	Specify the host information of the designated VPC. Go to the instance details page of the Hologres Management Console. In the Network Information area, click Copy after Designated VPC to obtain the host information before the domain name `:80`.
Database Name	Specify the database name of the Hologres instance. For more information about how to create a database, see Create a Database.
Account	Specify the custom account that you created. For more information, see Create a Custom User, where you select Select Member Role and choose Superuser.
Password	Specify the password of the custom account that you created.
Table Name	Enter a new table name or an existing table name. For an existing table name, the table schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the Hologres table that is automatically created when you deploy the RAG-based chatbot by using EAS.

OpenSearch

Specify the connection information of an OpenSearch instance of Vector Search Edition. For information on how to create an OpenSearch instance and prepare configuration items, see Prepare the vector database OpenSearch.

Parameter	Description
Version Type	Select OpenSearch.
Access Address	Specify the public endpoint of the OpenSearch instance of Vector Search Edition. You need to enable the public access feature for the OpenSearch instance of Vector Search Edition. For more information, see Prepare the vector database OpenSearch.
Instance ID	Obtain the instance ID from the OpenSearch Vector Search Edition instance list.
Username	Enter the username and password that you specified when you created the OpenSearch instance of Vector Search Edition.
Password
Table Name	Enter the name of the index table of the OpenSearch instance of Vector Search Edition that you created. For more information about how to prepare the index table, see Prepare the vector database OpenSearch.

ApsaraDB RDS for PostgreSQL

Specify the connection information of the ApsaraDB RDS for PostgreSQL instance. For information on how to create an ApsaraDB RDS for PostgreSQL instance and prepare configuration items, see Prepare a vector database by using ApsaraDB RDS for PostgreSQL.

Parameter	Description
Version Type	Select RDS PostgreSQL.
Host Address	Specify the internal network address of the ApsaraDB RDS for PostgreSQL instance. You can go to the ApsaraDB RDS for PostgreSQL console page and view the Database Connection page of the ApsaraDB RDS for PostgreSQL instance.
Port	The default value is 5432. Enter the value based on your actual situation.
Database	Specify the name of the database that you created. For more information about how to create a database and an account, see Create an Account and a Database, where: When you create an account, select Account Type and choose Privileged Account. When you create a database, select Authorized Account and choose the privileged account that you created.
Table Name	Specify the name of the database table.
Account	Specify the privileged account and password that you created. For more information about how to create a privileged account, see Create an Account and a Database, where you select Account Type and choose Privileged Account.
Password

VPC

Parameter	Description
VPC (VPC)	If you selected Separate LLM Deployment when deploying the RAG service, you must make sure it is connected to the LLM service: Public access: Select a virtual private cloud (VPC) that can be accessed over the Internet. For more information, see configure Internet access. Private access: Associate the same VPC with both the RAG service and the LLM service. If you need to use a model from Model Studio or perform Q&A with web search, you must configure a VPC that can be accessed over the Internet. For more information, see Configure Internet access. Network requirements for the vector database: The Faiss vector database does not require network access. Hologres, Elasticsearch, or ApsaraDB RDS for PostgreSQL can be accessed by EAS through the public network or private network. Private network access is recommended. Private network access requires that the VPC configured in EAS is consistent with the VPC of the vector database. For more information about how to create a VPC, a switch, and a security group, see Create and Manage a VPC and Create a Security Group. EAS can only access OpenSearch through the public network. For more information about how to configure the access method, see Step 2: Prepare Configuration Items.
vSwitch
Security Group Name

Step 2: Debug on the web UI

After the RAG service is successfully deployed, click View Web Application in the Service Method column to launch the web UI.

Follow the steps below to upload your knowledge base file on the web UI and test the Q&A chatbot.

1. Vector database and LLM settings

On the Settings tab, you can modify the embedding-related parameters and the LLM in use. It is recommended to use the default configuration.

Note

To use DashScope, you need to configure Internet access for EAS and configure the API key for Alibaba Model Studio. The Model Studio model call is billed separately. For more information, see Billable items.

Index parameter descriptions:

Parameter	Description
Index Name	The system supports updating existing indexes. You can select New from the drop-down list to add a new index and isolate different knowledge base data by specifying the index name. For more information, see How to Use RAG Service for Knowledge Base Data Isolation?.
EmbeddingType	Supports two model sources: Huggingface and Dashscope. Huggingface: The system provides built-in embedding models for you to choose from. Dashscope: Uses the Model Studio model, which defaults to the text-embedding-v2 model. For more information, see Embedding.
Embedding Dimension	The output vector dimension. The dimension setting directly affects the model's performance. After you select an embedding model, the system automatically configures the embedding dimension, and no manual operation is required.
Embedding Batch Size	The batch processing size.

LLM parameter descriptions

When you select Separate LLM Deployment, you need to refer to Deploy an LLM to deploy the LLM service. Then, click the LLM service name. In the Basic Information area, click View Invocation Information to obtain the service access address and token.

Note

Public access: Associate the RAG service with a VPC that support Internet access.
Private access: Place the RAG service and the LLM service in the same VPC.

The following table describes the parameters.

Parameter	Description
LLM Base URL	When using separate LLM deployment, specify the access address and token of the LLM service that you obtained. When using integrated LLM deployment, the system has already configured this parameter by default, and no modification is required.
API Key
Model Name	When deploying an LLM, if you select the accelerated deployment-vLLM mode, be sure to specify the model name, such as qwen2-72b-instruct. For other deployment modes, simply set the model name to `default`.

2. Upload knowledge base files

On the Upload tab, you can upload knowledge base files. The system automatically stores the knowledge base files in the vector database in the PAI-RAG format. For knowledge base files with the same name, other vector databases will overwrite the original files except Faiss. Supported file types include .html, .htm, .txt, .pdf, .pptx, .md, Excel (.xlsx or .xls), .jsonl, .jpeg, .jpg, .png, .csv, or Word (.docx), such as rag_chatbot_test_doc.txt. The supported upload methods are as follows:

Upload files from local (supports multi-file upload) or corresponding directory (Files or Directory tab)
Upload from OSS (Aliyun OSS tab)
Important
Before uploading, you must select Use OSS Storage and configure related parameters in the Large Language Model section of the Settings tab.

The status shown in the following figure shows that the knowledge base files are successfully uploaded.

Before uploading, you can modify the concurrency control and semantic chunking parameters. The parameter descriptions are as follows:

Parameter	Description
Number of workers to parallelize data-loading over	The concurrency control parameter. The default value is 4, indicating that the system supports starting four processes simultaneously to upload files. It is recommended to set the concurrency to the size of $GP U M e m ory /6 GB$ . For example, if the current GPU video memory is 24 GB, the concurrency can be set to 4.
Chunk Size	The size of each chunk. Unit: bytes. Default value: 500.
Chunk Overlap	The overlap between adjacent chunks. Default value: 10.
Process with MultiModal	Use a multimodal model to process images in PDF, Word, and MD files. If you choose to use a multimodal LLM, turn on this switch.
Process PDF with OCR	Use OCR mode to parse PDF files.

3. Model inference verification

On the Chat tab, select the knowledge base index (Index Name) to use, configure the Q&A strategy, and perform Q&A tests. The following four Q&A strategies are supported:

Retrieval: Directly retrieve and return the top K similar results from the vector database.
LLM: Directly use the LLM to answer.
Chat(Web Search): Automatically determine whether online search is needed based on the user's question. If online search is needed, input the search results and the user's question into the LLM service. To use online search, you need to configure public network connection for EAS.
Chat(Knowledge Base): Merge the results returned from the vector database retrieval with the user's question and fill them into the selected prompt template. Then, input them into the LLM service for processing to obtain the Q&A results.

More inference parameter descriptions are as follows:

General Parameters

Parameter	Description
Streaming Output	After you select Streaming Output, the system outputs the results in a streaming manner.
Need Citation	Whether a reference is needed in the answer.
Inference with multi-modal LLM	Whether to display images when using a multimodal LLM.

Vector Retrieval Parameters

Retrieval Mode: The following three retrieval methods are supported:

Embedding Only: Vector database-based retrieval. This retrieval method performs well in most complex scenarios, especially suitable for handling queries related to semantic similarity and contextual relevance.
Keyword Only: Keyword-based retrieval. This retrieval method has advantages in vertical domains with scarce corpora or scenarios requiring precise matching. PAI also offers keyword-based retrieval algorithms such as BM25 to perform sparse retrieval operations by calculating the overlap of keywords between the user query and knowledge documents. This method is simple and efficient.
Hybrid: Multimodal retrieval that combines vector database-based retrieval and keyword-based retrieval. To better use the advantages of two retrieval methods, PAI uses the reciprocal rank fusion (RRF) algorithm to calculate the weighted sum value of ranks by which a file is sorted in different retrieval methods to obtain a total score, improving the retrieval accuracy and efficiency. If you set Retrieval Mode to Hybrid, PAI uses the RRF algorithm by default to achieve multimodal retrieval.

The following table describes the supported vector database-based retrieval parameters.

Parameter	Description
Text Top K	Retrieves top K relevant text segments. Valid values: 0 to 100.
Image Top K	Retrieves top K relevant images. Valid values: 0 to 10.
Similarity Score Threshold	The similarity score threshold. A larger score indicates greater similarity.
Reranker Type	The reranking type. no-reranker model-based-reranker: performs high-accuracy reranking operations on top K results retrieved the first time to obtain more relevant and accurate retrieval results. Note When you use this feature the first time, model loading may consume a long time. You can select the reranking type based on your business requirements.
If you set Retrieval Mode to Hybrid, you can configure the following vector-based database retrieval parameters to adjust the proportion of vector-based database retrieval and keyword retrieval, optimizing the hybrid retrieval effect.
Weight of embedding retrieval results	The vector retrieval weight.
Weight of keyword retrieval results	The keyword retrieval weight.

Online Search Parameters

Parameter	Description
bing: Configure Bing search.
Bing API Key	Used to access Bing search. For more information about how to obtain a Bing API key, see Bing Web Search API.
Search Count	The number of web pages to search. The default value is 10.
Language	The search language. You can select zh-CN (Chinese) or en-US (English).

LLM Parameters
Temperature: Controls the randomness of the generated content. The lower the temperature, the more fixed the output result. The higher the temperature, the more diverse and creative the output result.

Step 3: API calling

The following content describes the API calling methods of common RAG features. For more information about the API calling methods of other features, such as managing knowledge base indexes and updating the configurations of a RAG service, see Applicable to images with the versions earlier than v0.3.0.

Important

The query and upload APIs can specify index_name to switch the knowledge base. If the index_name parameter is omitted, the default knowledge base is default_index. For more information, see How to Use RAG Service for Knowledge Base Data Isolation?

Obtain invocation information

Click the RAG service name to go to the Service Details page.
In the Basic Information area, click View Invocation Information.
In the Invocation Information dialog box, obtain the service access address and token.
Note
You can use public network access or private network access.
- To use public network access, the client must support Internet access.
- To use private network access, the client must be in the same VPC as the RAG service.

Upload knowledge base files

You can upload local knowledge base files through the API. You can query the status of the file upload task based on the task_id returned by the upload interface.

In the following example, replace <EAS_SERVICE_URL> with the access address of the RAG service and <EAS_TOKEN> with the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

Upload a single file

 # Replace <EAS_TOKEN> and <EAS_SERVICE_URL> with the service token and the service access address, respectively.
 # Replace the path after "-F 'files=@" with the path to your file.
 # Configure index_name as the name of your knowledge base index.
   curl -X 'POST' /api/v1/upload_data \
  -H 'Authorization: ' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@example_data/paul_graham/paul_graham_essay.txt' \
  -F 'index_name=default_index'

Upload multiple files. Use multiple -F 'files=@path' parameters with each parameter corresponding to a file to be uploaded, as shown in the example:

  # Replace <EAS_TOKEN> and <EAS_SERVICE_URL> with the service token and the service access address, respectively.
  # Replace the path after "-F 'files=@" with the path to your file.
  # Configure index_name as the name of your knowledge base index.
  curl -X 'POST' /api/v1/upload_data \
  -H 'Authorization: ' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@example_data/paul_graham/paul_graham_essay.txt' \
  -F 'files=@example_data/another_file1.md' \
  -F 'files=@example_data/another_file2.pdf' \
  -F 'index_name=default_index'

Query the task upload status

# Replace <EAS_TOKEN> and <EAS_SERVICE_URL> with the service token and access address, respectively.
# task_id indicates the task ID returned when you upload knowledge base files. 
curl -X 'GET' '<EAS_SERVICE_URL>/api/v1/get_upload_state?task_id=2c1e557733764fdb9fefa0635389****' -H 'Authorization: <EAS_TOKEN>'

Single-round conversation request

CURL command

Note: In the following example, replace <service_url> with the access address of the RAG service and <service_token> with the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

Retrieval: api/v1/query/retrieval

curl -X 'POST'  '<service_url>api/v1/query/retrieval' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

LLM: /api/v1/query/llm

curl -X 'POST'  '<service_url>api/v1/query/llm' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

Supports adding other adjustable inference parameters, such as {"question":"What is PAI?", "temperature": 0.9}.

Chat(Knowledge Base): api/v1/query

curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

Supports adding other adjustable inference parameters, such as {"question":"What is PAI?", "temperature": 0.9}.

Chat(Web Search):api/v1/query/search

curl --location '<service_url>api/v1/query/search' \
--header 'Authorization: <service_token>' \
--header 'Content-Type: application/json' \
--data '{"question":"China movie box office ranking", "stream": true}'

Python script

Note: In the following example, SERVICE_URL is configured as the access address of the RAG service and Authorization is configured as the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

import requests

SERVICE_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com/'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'MDA5NmJkNzkyMGM1Zj****YzM4M2YwMDUzZTdiZmI5YzljYjZmNA==',
}

def test_post_api_query(url):
    data = {
       "question":"What is PAI?" 
    }
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = dict(response.json())

    print(f"======= Question =======\n {data['question']}")
    if 'answer' in ans.keys():
        print(f"======= Answer =======\n {ans['answer']}")
    if 'docs' in ans.keys():
        print(f"======= Retrieved Docs =======\n {ans['docs']}\n\n")
 
# LLM 
test_post_api_query(SERVICE_URL + 'api/v1/query/llm')
# Retrieval
test_post_api_query(SERVICE_URL + 'api/v1/query/retrieval')
# Chat (Knowledge Base)
test_post_api_query(SERVICE_URL + 'api/v1/query')

Multi-round conversation request

LLM and Chat (Knowledge Base) support sending multi-round conversation requests. The following code example shows how to do this:

cURL command

The following example shows how to perform RAG conversation:

# Send a request. 
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

# Provide the session ID returned for the request. This ID uniquely identifies a conversation in the conversation history. After the session ID is provided, the corresponding conversation is stored and is automatically included in subsequent requests to call an LLM.
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What are the benefits of PAI?","session_id": "ed7a80e2e20442eab****"}'

# Provide the chat_history parameter, which contains the conversation history between you and the chatbot. The parameter value is a list in which each element indicates a single round of conversation in the {"user":"Inputs","bot":"Outputs"} format. Multiple conversations are sorted in chronological order.
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question":"What are its features?", "chat_history": [{"user":"What is PAI?", "bot":"PAI is Alibaba Cloud's platform for AI..."}]}'

# If you provide both the session_id and chat_history parameters, the conversation history is appended to the conversation that corresponds to the specified session ID. 
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question":"What are its features?", "chat_history": [{"user":"What is PAI?", "bot":"PAI is Alibaba Cloud's platform for AI..."}], "session_id": "1702ffxxad3xxx6fxxx97daf7c"}'

Python

import requests

SERVICE_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'MDA5NmJkN****jNlMDgzYzM4M2YwMDUzZTdiZmI5YzljYjZmNA==',
}

def test_post_api_query_with_chat_history(url):
    # Round 1 query
    data = {
       "question": "What is PAI?"
    }
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = dict(response.json())
    print(f"=======Round 1: Question =======\n {data['question']}")
    if 'answer' in ans.keys():
        print(f"=======Round 1: Answer =======\n {ans['answer']} session_id: {ans['session_id']}")
    if 'docs' in ans.keys():
        print(f"=======Round 1: Retrieved Docs =======\n {ans['docs']}")
   
    # Round 2 query
    data_2 = {
       "question": "What are the benefits of PAI?",
       "session_id": ans['session_id']
    }
    response_2 = requests.post(url, headers=headers, json=data_2)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans_2 = dict(response_2.json())
    print(f"=======Round 2: Question =======\n {data_2['question']}")
    if 'answer' in ans.keys():
        print(f"=======Round 2: Answer =======\n {ans_2['answer']} session_id: {ans_2['session_id']}")
    if 'docs' in ans.keys():
        print(f"=======Round 2: Retrieved Docs =======\n {ans['docs']}")
    print("\n")

# LLM
test_post_api_query_with_chat_history(SERVICE_URL + "api/v1/query/llm")
# Chat (Knowledge Base)
test_post_api_query_with_chat_history(SERVICE_URL + "api/v1/query")

Precautions

This practice is subject to the maximum number of tokens of an LLM service and is designed to help you understand the basic retrieval feature of a RAG-based LLM chatbot:

The chatbot is limited by the server resource size of the LLM service and the default number of tokens. The conversation length supported by the chatbot is also limited.
If multi-round conversation is not required, it is recommended to disable the with chat history feature to effectively reduce the possibility of reaching the limit.
Web UI operation method: On the Chat tab of the RAG service web UI, deselect the Chat history check box.

FAQ

How do I use a RAG service for knowledge base data isolation?

When different departments or individuals use their own independent knowledge bases, you can achieve effective data isolation by using the following methods:

On the Settings tab of the web UI, configure the following parameters and then click Add Index.
- Index Name: Select NEW.
- New Index Name: Customize a new index name. For example, INDEX_1.
- Path: When you select Faiss as the VectorStore, you need to update the Path synchronously to ensure that the index name at the end of the path is consistent with the new index name.
When you upload knowledge base files on the Upload tab, you can select Index Name. After the upload is complete, the files are saved under the selected index.
When you perform a conversation on the Chat tab, select the corresponding index name. The system uses the knowledge base files under the index for knowledge Q&A, thereby achieving data isolation for different knowledge bases.

How am I charged for deploying and using RAG services?

Billing description

When you deploy a RAG-based LLM chatbot, you are charged for only EAS resources. When you use the RAG-based LLM chatbot, if you use Alibaba Cloud Model Studio, vector databases, such as Elasticsearch, Hologres, OpenSearch, or ApsaraDB RDS for PostgreSQL, OSS, Internet NAT gateways, or network search services, such as Bing, you are charged based on the billing rule of each service.

Suspension of billing

After you stop an EAS service, you are not charged for only EAS resources. To stop charging for other services, stop or delete related instances by following the instructions in the document of the corresponding service.

Can I use knowledge base files uploaded by using APIs permanently?

Knowledge base files uploaded by using APIs are not permanently stored for RAG services. The storage duration depends on the configurations of the selected vector database, such as OSS, Elasticsearch, or Hologres. We recommend that you read related documents to learn storage strategies, ensuring that data can be stored permanently.

Why does not the parameter configured by using APIs take effect?

The RAG service of PAI allows you only to configure the parameters listed in RAG API reference by using APIs. You must configure other parameters on the web UI. For more information, see Step 2: Debug on the web UI.

References

You can also use EAS to complete the following scenario-based deployment:
EAS provides stress testing methods for LLM and common services. It can help create stress testing tasks to perform testing with a few clicks, achieving the performance of EAS services in a comprehensive manner. For more information, see Automatic service stress testing.