Skip to content

Help me understand the memory usage situation when using GPU #2118

Closed
@JianbangZ

Description

@JianbangZ

So I built with cuBLAS, quantize my 7B model to q4_0, offload all my 7B model layers to GPU with ./main, and I realize even though compute is happening in GPU and about 4GB VRAM is taken, the CPU memory never gets a chance to be released. So there is also about 4GB CPU memory in use.
Is this the right behavior? are the weights directlly offloaded to GPU, or loaded to CPU RAM first and then copied to VRAM? but Then why CPU memory is not released or not immediately?

I also tried the server/chat.sh program built with cuBLAS, and I see once server is uprunning, after a short moment CPU memory is released.
Help me understand please

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions