Errors w/ BPE tokenizers (GGML_ASSERT: llama.cpp:2029: codepoints_from_utf8(word).size() > 0 and more)

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I have a Mistral 7B based model [shisa-7b-v1](https://huggingface.co/augmxnt/shisa-7b-v1) that has an extended (128128) BPE tokenizer. This works fine and I have pulled the `vocab.json` from `tokenizer.json` (and there is a `special_tokens_map.json`, and some `added_tokens` in the `tokenizer.json`.

I am able to convert the model with `--vocabtype bpe` with no errors.

And I am actually able to run `llama_bench` on the model however when infererencing, I get this error:

```
GGML_ASSERT: llama.cpp:2695: codepoints_from_utf8(word).size() > 0
Aborted (core dumped)
```

# Current Behavior

As mentioned, there is an assert, here that get's triggered: https://github.com/ggerganov/llama.cpp/blob/bcc0eb4591bec5ec02fad3f2bdcb1b265052ea56/llama.cpp#L2695

I did a bit of poking and ended up hacking in a replacement token just to see if I could make it go:
```
      // GGML_ASSERT(codepoints_from_utf8(word).size() > 0); 
        if (codepoints_from_utf8(word).empty()) {                                                                        
            std::stringstream ss;                                                           
            for (unsigned char c : word) {  // Ensure char is treated as unsigned                           
                ss << std::hex << static_cast<int>(c) << " ";  // Convert each byte to hex                             
            }           
            LLAMA_LOG_WARN("%s: Word '%s' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌\n", __func__, ss.str().c_str());
            word = "❌❌";                                                          
        }                                                             
```

I tried to get a codepoint, and it turns out it only triggers once, but sadly seems to be a literal null character?
```
llm_load_vocab: Word '' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌         
```
Sadly this was not the only error once things got running, as this gets output as well
```
llm_load_vocab: mismatch in special tokens definition ( 1087/120128 vs 55/120128 ).                                                                                                                                                        
```

That's an aweful lot of special tokens? (there are only 4 in our [special_tokens_map.json](https://huggingface.co/augmxnt/shisa-7b-v1/blob/main/special_tokens_map.json)...

I modified the code to print out what tokens it thought were issues:
```
@@ -2811,9 +2827,11 @@ static void llm_load_vocab(
                         // Count manually found special tokens
                         special_tokens_count_from_verification++;
  
+
                         // If this manually found special token is not marked as such, flag a mismatch
                         if (vocab.id_to_token[id].type == LLAMA_TOKEN_TYPE_NORMAL) {
                             special_tokens_definition_mismatch = true;
+                            LLAMA_LOG_WARN("%s: Special token mismatch for token '%s'. Expected special, found normal.\n", __func__, vocab.id_to_token[id].text.c_str());
                         }
                     }
                 }

```
It prints out lots of regular tokens, not sure why it's expecting special tokens?
```
llm_load_vocab: Special token mismatch for token '▁7.5J▁7.50-18'. Expected special, found normal.                                                                                                                                          
llm_load_vocab: Special token mismatch for token 'MG150'. Expected special, found normal.                            
llm_load_vocab: Special token mismatch for token 'デイトナ▁1165'. Expected special, found normal.                                                                                                                                          
llm_load_vocab: Special token mismatch for token '(火)▁22'. Expected special, found normal.                                                                                                                                                
llm_load_vocab: Special token mismatch for token '平成24'. Expected special, found normal.                           
llm_load_vocab: Special token mismatch for token '18V'. Expected special, found normal.                              
llm_load_vocab: Special token mismatch for token '001▁概要▁仕様書▁'. Expected special, found normal.                                                                                                                                       
llm_load_vocab: Special token mismatch for token '分（▁02/'. Expected special, found normal.                         
llm_load_vocab: Special token mismatch for token '(火)▁23'. Expected special, found normal.                          
llm_load_vocab: Special token mismatch for token '7750搭載'. Expected special, found normal.                         
llm_load_vocab: Special token mismatch for token 'USB▁3.0'. Expected special, found normal.                          
...
```

Once everything is loaded we get:
```
<s>Hello world: terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at
Aborted (core dumped)
```

But I didn't follow up more, since it seems that there's somewhere either in the conversion or the token handling code that's beyond my ken and messed up.

Note, I found to related open discussions/issues (but I don't think they got past the initial assert) - they are both models that use extended BPE tokenizers I believe though:

* https://github.com/ggerganov/llama.cpp/issues/3133 - w/ [InternLM](https://github.com/InternLM/InternLM) they can convert, then get the ASSERT error
* https://github.com/ggerganov/llama.cpp/discussions/3498#discussioncomment-7226038 - In this discussion on [ELYZA-japanese-Llama-2-7b-fast-instruct](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b-fast-instruct), a Llama2  Japanese model w/ an extended tokenizer (ours is extended similar to theirs) they report the same ASSERT error

These issues seem to be unresolved from a few months back, but reporting this new issue since hopefully it sheds some more light on what might be going on? Maybe the bpe conversion is actually broken? 

In our base model card, we actually have a [list of models using other tokenizers](https://huggingface.co/augmxnt/shisa-base-7b-v1#tokenizer) so that might also help in tracking down issues. StableLM Beta JAVocab and CALM2-7B are two more Llama2 models using non-standard tokenizers. 


# Environment and Context

* Arch Linux, RTX 4090/3090
* python 3.11.6 in a custom mamba env for the conversion (just the requirements* from the repo installed)

Can relay more info if not reproducible but I don't think that's it



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Errors w/ BPE tokenizers (GGML_ASSERT: llama.cpp:2029: codepoints_from_utf8(word).size() > 0 and more) #4360

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Errors w/ BPE tokenizers (GGML_ASSERT: llama.cpp:2029: codepoints_from_utf8(word).size() > 0 and more) #4360

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions