Skip to content

The stop_words still does not work with the latest tensorrtllm_backend and TensorRT-LLM #128

Open
@activezhao

Description

@activezhao

First:

I download the latest tensorrtllm_backend of main branch.

git clone -b main  https://github.com/triton-inference-server/tensorrtllm_backend.git

Second:

I execute the following command to build a docker image using the latest tensorrtllm_backend of main branch.

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Third:

I get a docker image like this:

# docker images

REPOSITORY                    TAG                         IMAGE ID       CREATED         SIZE
triton_trt_llm               latest                      cc73de886a6d   5 hours ago     36GB

Fourth:

I launch the docker image of triton_trt_llm

docker run -idt -p 8250:8000 -p 8251:8001 -p 8252:8002 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /tensorrt:/tensorrtllm_backend triton_trt_llm /bin/sh

Fifth:

In the container, I execute the command for build-tensorrt-llm
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md#build-tensorrt-llm

Sixth:

I build engines with code-llama-7b

python build.py --model_dir /tensorrtllm_backend/tensorrtllm_backend/CodeLlama-7b-hf/  \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir /tensorrtllm_backend/tensorrtllm_backend/trt_llama_7b_fp16_kv_cache_inflight_batching_stop/4-gpu/  \
                --vocab_size 32016  \
                --rotary_base 1000000  \
                --max_batch_size 32  \
                --world_size 4 \
                --tp_size 4

Finally:

I call the checkpoint like this, as we can see, the stop_words dose not works.

curl --noproxy '*'  POST localhost:8250/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 100, "bad_words": "", "stop_words": "quickSort"}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"<s> def quickSort(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot = arr[0]\n        lesser = [x for x in arr[1:] if x <= pivot]\n        greater = [x for x in arr[1:] if x > pivot]\n        return quickSort(lesser) + [pivot] + quickSort(greater)\n\n\ndef quickSort2(arr):\n   "
}

Add log print in preprocessing's model.py

    def _to_word_list_format(self, word_dict: List[List[str]]):
        '''
        format of word_dict
            len(word_dict) should be same to batch_size
            word_dict[i] means the words for batch i
            len(word_dict[i]) must be 1, which means it only contains 1 string
            This string can contains several sentences and split by ",".
            For example, if word_dict[2] = " I am happy, I am sad", then this function will return
            the ids for two short sentences " I am happy" and " I am sad".
        '''
        assert self.tokenizer != None, "need to set tokenizer"

        if word_dict is None:
            # Return an empty array of shape (1,2,0)
            return np.empty([1, 2, 0], dtype="int32")

        flat_ids = []
        offsets = []
        for word_dict_item in word_dict:
            item_flat_ids = []
            item_offsets = []

            if isinstance(word_dict_item[0], bytes):
                word_dict_item = [word_dict_item[0].decode()]

            words = list(csv.reader(word_dict_item))[0]
            for word in words:
                self.logger.log_info(f"================== preprocessing _to_word_list_format word: {word}")
                ids = self.tokenizer.encode(word)
                self.logger.log_info(f"================== preprocessing _to_word_list_format ids: {ids}")
                if len(ids) == 0:
                    continue

                item_flat_ids += ids
                item_offsets.append(len(ids))

And here is the preprocessing _to_word_list_format ids:

I1114 08:45:41.055726 24910 python_be.cc:1307] model preprocessing, instance preprocessing_0_0, executing 1 requests
I1114 08:45:41.084479 24910 model.py:255] ================== preprocessing _to_word_list_format word: quickSort
I1114 08:45:41.084553 24910 model.py:257] ================== preprocessing _to_word_list_format ids: [1, 4996, 13685]
I1114 08:45:41.084808 24910 infer_response.cc:167] add response output: output: INPUT_ID, type: INT32, shape: [1,4]

Add log print in postprocessing's model.py

    def _postprocessing(self, tokens_batch, sequence_lengths):
        outputs = []
        for batch_idx, beam_tokens in enumerate(tokens_batch):
            for beam_idx, tokens in enumerate(beam_tokens):
                self.logger.log_info(f"================== postprocessing _postprocessing tokens: {tokens}")
                seq_len = sequence_lengths[batch_idx][beam_idx]
                output = self.tokenizer.decode(tokens[:seq_len])
                self.logger.log_info(f"================== postprocessing _postprocessing tokens[:seq_len]: {tokens[:seq_len]}")
                self.logger.log_info(f"================== postprocessing _postprocessing output: {output}")
                outputs.append(output.encode('utf8'))
        return outputs

And here is the preprocessing _to_word_list_format ids:

I1114 08:45:42.255417 24910 model.py:156] ================== postprocessing _postprocessing tokens: [    1   822  4996 13685 29898  2749  1125    13  1678   565  7431 29898
  2749 29897  5277 29871 29896 29901    13  4706   736  3948    13  1678
  1683 29901    13  4706 24438   353  3948 29961 29900 29962    13  4706
  3109   261   353   518 29916   363   921   297  3948 29961 29896 17531
   565   921  5277 24438 29962    13  4706  7621   353   518 29916   363
   921   297  3948 29961 29896 17531   565   921  1405 24438 29962    13
  4706   736  4996 13685 29898  2222   261 29897   718   518 29886 11002
 29962   718  4996 13685 29898  7979  1008 29897    13    13    13  1753
  4996 13685 29906 29898  2749  1125    13  1678]
I1114 08:45:42.255774 24910 model.py:159] ================== postprocessing _postprocessing tokens[:seq_len]: [    1   822  4996 13685 29898  2749  1125    13  1678   565  7431 29898
  2749 29897  5277 29871 29896 29901    13  4706   736  3948    13  1678
  1683 29901    13  4706 24438   353  3948 29961 29900 29962    13  4706
  3109   261   353   518 29916   363   921   297  3948 29961 29896 17531
   565   921  5277 24438 29962    13  4706  7621   353   518 29916   363
   921   297  3948 29961 29896 17531   565   921  1405 24438 29962    13
  4706   736  4996 13685 29898  2222   261 29897   718   518 29886 11002
 29962   718  4996 13685 29898  7979  1008 29897    13    13    13  1753
  4996 13685 29906 29898  2749  1125    13  1678]
I1114 08:45:42.255801 24910 model.py:160] ================== postprocessing _postprocessing output: <s> def quickSort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        lesser = [x for x in arr[1:] if x <= pivot]
        greater = [x for x in arr[1:] if x > pivot]
        return quickSort(lesser) + [pivot] + quickSort(greater)

def quickSort2(arr):

As we can see, the tokens of stop_words [4996, 13685] appears in the postproceiing's output tokens, but the inference dose not stop early.

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions