Deduplicated dequantization code #1453

JohannesGaessler · 2023-05-14T15:08:28Z

For my GPU acceleration PR #1412 I used a template to decouple the code for matrix vector multiplication and dequantization. This PR applies the same principle to the dequantization during prompt processing: the same dequantization kernels can be reused with a different template. This allows for the deduplication of CUDA code to ensure consistency. As a side effect the new kernels are also slightly faster on my hardware.

Performance numbers for perplexity calculation on the first 100 lines of wikitext:

GPU	Model	ms/t master	ms/t PR
RTX 3090	7b q4_0	3.61	3.60
RTX 3090	7b q4_1	3.82	3.74
RTX 3090	7b q5_0	3.75	3.64
RTX 3090	7b q5_1	3.75	3.67
RTX 3090	7b q8_0	4.29	4.05
RTX 3090	7b f16	4.91	4.86
GTX 1070	7b q4_0	9.78	7.39
GTX 1070	7b q4_1	9.86	7.67
GTX 1070	7b q5_0	10.01	7.63
GTX 1070	7b q5_1	10.12	7.79
GTX 1070	7b q8_0	11.88	8.28
GTX 1070	7b f16	10.62	10.69

~~I will also add GTX 1070 numbers once I have them.~~ Done.

The goal of this PR is not to optimize performance. The goal is to simplify the code base to allow for easier development. If the new kernels don't cause a performance regression I consider that good enough.

ggerganov · 2023-05-14T18:17:28Z

I don't see performance degradation on RTX 4080

Deduplicated dequantization code

0e8ca77

JohannesGaessler added the refactoring Refactoring label May 14, 2023

ggerganov requested a review from slaren May 14, 2023 18:17

slaren approved these changes May 14, 2023

View reviewed changes

ggerganov merged commit eb36362 into ggml-org:master May 14, 2023

JohannesGaessler mentioned this pull request May 14, 2023

OpenCL dequant_mul_mat #1459

Merged

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicated dequantization code #1453

Deduplicated dequantization code #1453

Uh oh!

JohannesGaessler commented May 14, 2023 •

edited

Loading

Uh oh!

ggerganov commented May 14, 2023

Uh oh!

Uh oh!

Deduplicated dequantization code #1453

Deduplicated dequantization code #1453

Uh oh!

Conversation

JohannesGaessler commented May 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 14, 2023

Uh oh!

Uh oh!

JohannesGaessler commented May 14, 2023 •

edited

Loading