Help me understand the memory usage situation when using GPU

So I built with cuBLAS, quantize my 7B model to q4_0, offload all my 7B model layers to GPU with ./main, and I realize even though compute is happening in GPU and about 4GB VRAM is taken, the CPU memory never gets a chance to be released. So there is also about 4GB CPU memory in use.
Is this the right behavior? are the weights directlly offloaded to GPU, or loaded to CPU RAM first and then copied to VRAM? but Then why CPU memory is not released or not immediately?

I also tried the server/chat.sh program built with cuBLAS, and I see once server is uprunning, after a short moment CPU memory is released. 
Help me understand please

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help me understand the memory usage situation when using GPU #2118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Help me understand the memory usage situation when using GPU #2118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions