-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Description
Some models are extremely sensitive to the prompt format being correct. Without it they generate gibberish.
beam-search calls llama_tokenize with parse_special = false. Once I switched that to true the special tokens in my prompts were parsed correctly and it would generate reasonable output.
It is also set to false in the imatrix generation. Thus, all sample data generated from common chat and instruction datasets in the prompt format of the model will not be tokenized in the same way that the model will see during regular inference. Shouldn't it have an impact that zero real prompt formats were evaluated for the imatrix generation?
To get a better idea of the impact I've tested this with the perplexity measurement which also does not parse special tokens. In my quick ChatML test with CodeQwen-1.5 the perplexity went up by 40% once special tokens were parsed. Maybe that's due to the raw chunking that evaluates multiple prompts at once and also breaks them in the middle?
Side note: The tokenization took 500x longer with parse_special = true.
Maybe that's something to be investigated why the PPL went up when special tokens were enabled, and if special token parsing could improve imatrix results? A reason why it might be disabled is stated here.