-
Notifications
You must be signed in to change notification settings - Fork 12.2k
vulkan: Add fusion support for RMS_NORM+MUL #14366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
// Since norm is the first operand of mul, it must be the same shape | ||
GGML_ASSERT(ggml_are_same_shape(mul, norm)); | ||
|
||
// XXX TODO: Do we need a way to indicate that the user doesn't need the intermediate result? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren I'd like to get your thoughts on this. How do we know that the user doesn't want to have all the intermediate results available to them? Seems like they'd need to opt in or out of fusion somehow... GGML_TENSOR_FLAG_OUTPUT seems like almost what we would want, but it's slightly different and doesn't seem to be used much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for GGML_TENSOR_FLAG_OUTPUT
is enough. This flag guarantees that the tensor will not be overwritten by another computation, so it is necessary to set this flag if the user wants the output of a tensor. In cases like imatrix
, the graph is evaluated one operation at a time, so fusion should never be active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely because getting details like this one right is not completely obvious, it would be good to have a function in ggml similar to this one that checks if multiple operations can be fused, that all the backends can use to implement operator fusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can move some of the logic there, but I think eventually some of the implementation details may differ between backends. Hopefully there's a way to have some common logic and some backend-specific logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've extracted out some common logic into a helper function.
Forgot to include perf results. Helps pp, but no meaningful change to tg.
|
I picked rms_norm+mul to fuse because it's commonly generated in llm_graph_context::build_norm.