Skip to content

Deduplicated dequantization code #1453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 14, 2023

For my GPU acceleration PR #1412 I used a template to decouple the code for matrix vector multiplication and dequantization. This PR applies the same principle to the dequantization during prompt processing: the same dequantization kernels can be reused with a different template. This allows for the deduplication of CUDA code to ensure consistency. As a side effect the new kernels are also slightly faster on my hardware.

Performance numbers for perplexity calculation on the first 100 lines of wikitext:

GPU Model ms/t master ms/t PR
RTX 3090 7b q4_0 3.61 3.60
RTX 3090 7b q4_1 3.82 3.74
RTX 3090 7b q5_0 3.75 3.64
RTX 3090 7b q5_1 3.75 3.67
RTX 3090 7b q8_0 4.29 4.05
RTX 3090 7b f16 4.91 4.86
GTX 1070 7b q4_0 9.78 7.39
GTX 1070 7b q4_1 9.86 7.67
GTX 1070 7b q5_0 10.01 7.63
GTX 1070 7b q5_1 10.12 7.79
GTX 1070 7b q8_0 11.88 8.28
GTX 1070 7b f16 10.62 10.69

I will also add GTX 1070 numbers once I have them. Done.

The goal of this PR is not to optimize performance. The goal is to simplify the code base to allow for easier development. If the new kernels don't cause a performance regression I consider that good enough.

@JohannesGaessler JohannesGaessler added the refactoring Refactoring label May 14, 2023
@ggerganov
Copy link
Member

I don't see performance degradation on RTX 4080

@ggerganov ggerganov requested a review from slaren May 14, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants