-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Closed
Labels
Description
CTranslate2 is a "competitor" to llama.cpp that advertises itself with:
Fast and efficient execution on CPU and GPU
The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.
I am no expert in LLMs and I don't know what these optimizations are, but I am asking: would it be possible/feasible and/or desirable to implement these optimizations into llama.cpp or GGML?
lin72h, PredatorIWD, loretoparisi, Azeirah, joshuabld and 3 more