Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I have a Mistral 7B based model shisa-7b-v1 that has an extended (128128) BPE tokenizer. This works fine and I have pulled the vocab.json
from tokenizer.json
(and there is a special_tokens_map.json
, and some added_tokens
in the tokenizer.json
.
I am able to convert the model with --vocabtype bpe
with no errors.
And I am actually able to run llama_bench
on the model however when infererencing, I get this error:
GGML_ASSERT: llama.cpp:2695: codepoints_from_utf8(word).size() > 0
Aborted (core dumped)
Current Behavior
As mentioned, there is an assert, here that get's triggered: https://github.com/ggerganov/llama.cpp/blob/bcc0eb4591bec5ec02fad3f2bdcb1b265052ea56/llama.cpp#L2695
I did a bit of poking and ended up hacking in a replacement token just to see if I could make it go:
// GGML_ASSERT(codepoints_from_utf8(word).size() > 0);
if (codepoints_from_utf8(word).empty()) {
std::stringstream ss;
for (unsigned char c : word) { // Ensure char is treated as unsigned
ss << std::hex << static_cast<int>(c) << " "; // Convert each byte to hex
}
LLAMA_LOG_WARN("%s: Word '%s' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌\n", __func__, ss.str().c_str());
word = "❌❌";
}
I tried to get a codepoint, and it turns out it only triggers once, but sadly seems to be a literal null character?
llm_load_vocab: Word '' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌
Sadly this was not the only error once things got running, as this gets output as well
llm_load_vocab: mismatch in special tokens definition ( 1087/120128 vs 55/120128 ).
That's an aweful lot of special tokens? (there are only 4 in our special_tokens_map.json...
I modified the code to print out what tokens it thought were issues:
@@ -2811,9 +2827,11 @@ static void llm_load_vocab(
// Count manually found special tokens
special_tokens_count_from_verification++;
+
// If this manually found special token is not marked as such, flag a mismatch
if (vocab.id_to_token[id].type == LLAMA_TOKEN_TYPE_NORMAL) {
special_tokens_definition_mismatch = true;
+ LLAMA_LOG_WARN("%s: Special token mismatch for token '%s'. Expected special, found normal.\n", __func__, vocab.id_to_token[id].text.c_str());
}
}
}
It prints out lots of regular tokens, not sure why it's expecting special tokens?
llm_load_vocab: Special token mismatch for token '▁7.5J▁7.50-18'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token 'MG150'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token 'デイトナ▁1165'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '(火)▁22'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '平成24'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '18V'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '001▁概要▁仕様書▁'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '分(▁02/'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '(火)▁23'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token '7750搭載'. Expected special, found normal.
llm_load_vocab: Special token mismatch for token 'USB▁3.0'. Expected special, found normal.
...
Once everything is loaded we get:
<s>Hello world: terminate called after throwing an instance of 'std::out_of_range'
what(): unordered_map::at
Aborted (core dumped)
But I didn't follow up more, since it seems that there's somewhere either in the conversion or the token handling code that's beyond my ken and messed up.
Note, I found to related open discussions/issues (but I don't think they got past the initial assert) - they are both models that use extended BPE tokenizers I believe though:
- [Feature Request] Support InternLM Deploy #3133 - w/ InternLM they can convert, then get the ASSERT error
- How to convert to gguf format with tokenizer.json file? #3498 (reply in thread) - In this discussion on ELYZA-japanese-Llama-2-7b-fast-instruct, a Llama2 Japanese model w/ an extended tokenizer (ours is extended similar to theirs) they report the same ASSERT error
These issues seem to be unresolved from a few months back, but reporting this new issue since hopefully it sheds some more light on what might be going on? Maybe the bpe conversion is actually broken?
In our base model card, we actually have a list of models using other tokenizers so that might also help in tracking down issues. StableLM Beta JAVocab and CALM2-7B are two more Llama2 models using non-standard tokenizers.
Environment and Context
- Arch Linux, RTX 4090/3090
- python 3.11.6 in a custom mamba env for the conversion (just the requirements* from the repo installed)
Can relay more info if not reproducible but I don't think that's it