Is it possible to run multiple LLM instances in parallel using multithreading to handle multiple queries simultaneously on Jetson Orin AGX?

I’m exploring the possibility of running multiple LLM (Large Language Model) instances in parallel on the Jetson Orin AGX using multithreading or multiprocessing. The goal is to handle multiple queries simultaneously to improve performance and responsiveness in real-time applications. I’d like to know if this is feasible given the Orin AGX’s GPU and CPU architecture, and what would be the recommended approach — whether using threading, multiprocessing, or containerization. Any insights on resource allocation, performance optimization, or example implementations would be highly appreciated.

Hi,

Please check our tutorial below for running LLM models:

Thanks.

Dear Responsible Moderator @AastaLLL
I’m aware of how to use LLM models on the device. That wasn’t my query.
My question was: Is it possible to run multiple LLaMA instances at the same time?
If there’s any possible way to do this and you could share it, it would be really helpful.

Hi,

Is your goal to handle multiple queries simultaneously?
If so, you can try the above sample as it can handle multiple queries at the same time.

Loading the same model multiple times will take much more memory which might not be the optimal way.

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.