Description
First:
I download the latest tensorrtllm_backend
of main branch.
git clone -b main https://github.com/triton-inference-server/tensorrtllm_backend.git
Second:
I execute the following command to build a docker image using the latest tensorrtllm_backend
of main branch.
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
Third:
I get a docker image like this:
# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
triton_trt_llm latest cc73de886a6d 5 hours ago 36GB
Fourth:
I launch the docker image of triton_trt_llm
docker run -idt -p 8250:8000 -p 8251:8001 -p 8252:8002 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /tensorrt:/tensorrtllm_backend triton_trt_llm /bin/sh
Fifth:
In the container, I execute the command for build-tensorrt-llm
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md#build-tensorrt-llm
Sixth:
I build engines with code-llama-7b
python build.py --model_dir /tensorrtllm_backend/tensorrtllm_backend/CodeLlama-7b-hf/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--paged_kv_cache \
--use_inflight_batching \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir /tensorrtllm_backend/tensorrtllm_backend/trt_llama_7b_fp16_kv_cache_inflight_batching_stop/4-gpu/ \
--vocab_size 32016 \
--rotary_base 1000000 \
--max_batch_size 32 \
--world_size 4 \
--tp_size 4
Finally:
I call the checkpoint like this, as we can see, the stop_words dose not works.
curl --noproxy '*' POST localhost:8250/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 100, "bad_words": "", "stop_words": "quickSort"}'
{
"model_name":"ensemble",
"model_version":"1",
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":"<s> def quickSort(arr):\n if len(arr) <= 1:\n return arr\n else:\n pivot = arr[0]\n lesser = [x for x in arr[1:] if x <= pivot]\n greater = [x for x in arr[1:] if x > pivot]\n return quickSort(lesser) + [pivot] + quickSort(greater)\n\n\ndef quickSort2(arr):\n "
}
Add log print in preprocessing's model.py
def _to_word_list_format(self, word_dict: List[List[str]]):
'''
format of word_dict
len(word_dict) should be same to batch_size
word_dict[i] means the words for batch i
len(word_dict[i]) must be 1, which means it only contains 1 string
This string can contains several sentences and split by ",".
For example, if word_dict[2] = " I am happy, I am sad", then this function will return
the ids for two short sentences " I am happy" and " I am sad".
'''
assert self.tokenizer != None, "need to set tokenizer"
if word_dict is None:
# Return an empty array of shape (1,2,0)
return np.empty([1, 2, 0], dtype="int32")
flat_ids = []
offsets = []
for word_dict_item in word_dict:
item_flat_ids = []
item_offsets = []
if isinstance(word_dict_item[0], bytes):
word_dict_item = [word_dict_item[0].decode()]
words = list(csv.reader(word_dict_item))[0]
for word in words:
self.logger.log_info(f"================== preprocessing _to_word_list_format word: {word}")
ids = self.tokenizer.encode(word)
self.logger.log_info(f"================== preprocessing _to_word_list_format ids: {ids}")
if len(ids) == 0:
continue
item_flat_ids += ids
item_offsets.append(len(ids))
And here is the preprocessing _to_word_list_format ids:
I1114 08:45:41.055726 24910 python_be.cc:1307] model preprocessing, instance preprocessing_0_0, executing 1 requests
I1114 08:45:41.084479 24910 model.py:255] ================== preprocessing _to_word_list_format word: quickSort
I1114 08:45:41.084553 24910 model.py:257] ================== preprocessing _to_word_list_format ids: [1, 4996, 13685]
I1114 08:45:41.084808 24910 infer_response.cc:167] add response output: output: INPUT_ID, type: INT32, shape: [1,4]
Add log print in postprocessing's model.py
def _postprocessing(self, tokens_batch, sequence_lengths):
outputs = []
for batch_idx, beam_tokens in enumerate(tokens_batch):
for beam_idx, tokens in enumerate(beam_tokens):
self.logger.log_info(f"================== postprocessing _postprocessing tokens: {tokens}")
seq_len = sequence_lengths[batch_idx][beam_idx]
output = self.tokenizer.decode(tokens[:seq_len])
self.logger.log_info(f"================== postprocessing _postprocessing tokens[:seq_len]: {tokens[:seq_len]}")
self.logger.log_info(f"================== postprocessing _postprocessing output: {output}")
outputs.append(output.encode('utf8'))
return outputs
And here is the preprocessing _to_word_list_format ids:
I1114 08:45:42.255417 24910 model.py:156] ================== postprocessing _postprocessing tokens: [ 1 822 4996 13685 29898 2749 1125 13 1678 565 7431 29898
2749 29897 5277 29871 29896 29901 13 4706 736 3948 13 1678
1683 29901 13 4706 24438 353 3948 29961 29900 29962 13 4706
3109 261 353 518 29916 363 921 297 3948 29961 29896 17531
565 921 5277 24438 29962 13 4706 7621 353 518 29916 363
921 297 3948 29961 29896 17531 565 921 1405 24438 29962 13
4706 736 4996 13685 29898 2222 261 29897 718 518 29886 11002
29962 718 4996 13685 29898 7979 1008 29897 13 13 13 1753
4996 13685 29906 29898 2749 1125 13 1678]
I1114 08:45:42.255774 24910 model.py:159] ================== postprocessing _postprocessing tokens[:seq_len]: [ 1 822 4996 13685 29898 2749 1125 13 1678 565 7431 29898
2749 29897 5277 29871 29896 29901 13 4706 736 3948 13 1678
1683 29901 13 4706 24438 353 3948 29961 29900 29962 13 4706
3109 261 353 518 29916 363 921 297 3948 29961 29896 17531
565 921 5277 24438 29962 13 4706 7621 353 518 29916 363
921 297 3948 29961 29896 17531 565 921 1405 24438 29962 13
4706 736 4996 13685 29898 2222 261 29897 718 518 29886 11002
29962 718 4996 13685 29898 7979 1008 29897 13 13 13 1753
4996 13685 29906 29898 2749 1125 13 1678]
I1114 08:45:42.255801 24910 model.py:160] ================== postprocessing _postprocessing output: <s> def quickSort(arr):
if len(arr) <= 1:
return arr
else:
pivot = arr[0]
lesser = [x for x in arr[1:] if x <= pivot]
greater = [x for x in arr[1:] if x > pivot]
return quickSort(lesser) + [pivot] + quickSort(greater)
def quickSort2(arr):
As we can see, the tokens of stop_words [4996, 13685] appears in the postproceiing's output tokens, but the inference dose not stop early.