Skip to content

Errors w/ BPE tokenizers (GGML_ASSERT: llama.cpp:2029: codepoints_from_utf8(word).size() > 0 and more) #4360

Closed
@lhl

Description

@lhl

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I have a Mistral 7B based model shisa-7b-v1 that has an extended (128128) BPE tokenizer. This works fine and I have pulled the vocab.json from tokenizer.json (and there is a special_tokens_map.json, and some added_tokens in the tokenizer.json.

I am able to convert the model with --vocabtype bpe with no errors.

And I am actually able to run llama_bench on the model however when infererencing, I get this error:

GGML_ASSERT: llama.cpp:2695: codepoints_from_utf8(word).size() > 0
Aborted (core dumped)

Current Behavior

As mentioned, there is an assert, here that get's triggered: https://github.com/ggerganov/llama.cpp/blob/bcc0eb4591bec5ec02fad3f2bdcb1b265052ea56/llama.cpp#L2695

I did a bit of poking and ended up hacking in a replacement token just to see if I could make it go:

      // GGML_ASSERT(codepoints_from_utf8(word).size() > 0); 
        if (codepoints_from_utf8(word).empty()) {                                                                        
            std::stringstream ss;                                                           
            for (unsigned char c : word) {  // Ensure char is treated as unsigned                           
                ss << std::hex << static_cast<int>(c) << " ";  // Convert each byte to hex                             
            }           
            LLAMA_LOG_WARN("%s: Word '%s' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌\n", __func__, ss.str().c_str());
            word = "❌❌";                                                          
        }                                                             

I tried to get a codepoint, and it turns out it only triggers once, but sadly seems to be a literal null character?

llm_load_vocab: Word '' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌         

Sadly this was not the only error once things got running, as this gets output as well

llm_load_vocab: mismatch in special tokens definition ( 1087/120128 vs 55/120128 ).                                                                                                                                                        

That's an aweful lot of special tokens? (there are only 4 in our special_tokens_map.json...

I modified the code to print out what tokens it thought were issues:

@@ -2811,9 +2827,11 @@ static void llm_load_vocab(
                         // Count manually found special tokens
                         special_tokens_count_from_verification++;
  
+
                         // If this manually found special token is not marked as such, flag a mismatch
                         if (vocab.id_to_token[id].type == LLAMA_TOKEN_TYPE_NORMAL) {
                             special_tokens_definition_mismatch = true;
+                            LLAMA_LOG_WARN("%s: Special token mismatch for token '%s'. Expected special, found normal.\n", __func__, vocab.id_to_token[id].text.c_str());
                         }
                     }
                 }

It prints out lots of regular tokens, not sure why it's expecting special tokens?

llm_load_vocab: Special token mismatch for token '▁7.5J▁7.50-18'. Expected special, found normal.                                                                                                                                          
llm_load_vocab: Special token mismatch for token 'MG150'. Expected special, found normal.                            
llm_load_vocab: Special token mismatch for token 'デイトナ▁1165'. Expected special, found normal.                                                                                                                                          
llm_load_vocab: Special token mismatch for token '(火)▁22'. Expected special, found normal.                                                                                                                                                
llm_load_vocab: Special token mismatch for token '平成24'. Expected special, found normal.                           
llm_load_vocab: Special token mismatch for token '18V'. Expected special, found normal.                              
llm_load_vocab: Special token mismatch for token '001▁概要▁仕様書▁'. Expected special, found normal.                                                                                                                                       
llm_load_vocab: Special token mismatch for token '分(▁02/'. Expected special, found normal.                         
llm_load_vocab: Special token mismatch for token '(火)▁23'. Expected special, found normal.                          
llm_load_vocab: Special token mismatch for token '7750搭載'. Expected special, found normal.                         
llm_load_vocab: Special token mismatch for token 'USB▁3.0'. Expected special, found normal.                          
...

Once everything is loaded we get:

<s>Hello world: terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at
Aborted (core dumped)

But I didn't follow up more, since it seems that there's somewhere either in the conversion or the token handling code that's beyond my ken and messed up.

Note, I found to related open discussions/issues (but I don't think they got past the initial assert) - they are both models that use extended BPE tokenizers I believe though:

These issues seem to be unresolved from a few months back, but reporting this new issue since hopefully it sheds some more light on what might be going on? Maybe the bpe conversion is actually broken?

In our base model card, we actually have a list of models using other tokenizers so that might also help in tracking down issues. StableLM Beta JAVocab and CALM2-7B are two more Llama2 models using non-standard tokenizers.

Environment and Context

  • Arch Linux, RTX 4090/3090
  • python 3.11.6 in a custom mamba env for the conversion (just the requirements* from the repo installed)

Can relay more info if not reproducible but I don't think that's it

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions