Skip to content

Bug: Or Feature? BPE Tokenization mutates whitespaces into double-whitespace tokens when add_prefix_space is true (default) #8023

Closed
@cmp-nct

Description

@cmp-nct

What happened?

This is a bit discussed here already: #7938
<|assistant|>

32001 -> '<|assistant|>'
   259 -> '  '

Also <|assistant|>\n:

32001 -> '<|assistant|>'
29871 -> ' '
    13 -> '
'

What happens is that the single whitespace, that follows a special token is mutated into a double-whitespace token (259) because add_prefix_space is triggered in llama.cpp when a special token is encountered.

In the second example the template actually wants a \n after assistant, however the special behavior sneaks a space in between.

Is this intended behavior / correct ?

When running PHI3 and asking for a generation after <|assistant|>, phi3 is adamant in responding with a whitespace or a combination token that starts with a whitespace.
When disabling add_prefix_whitespace and adding a \n after assistant, this issue is resolved and phi responds right away with normal text.

Name and Version

ba58993

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedlow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions