Bayesian Modeling of Energy Consumption for CPU vs. CPU-GPU-based Tools

Hi all,

I’m currently working on a Bayesian analysis project to model the energy consumption (in joules) of two different software fault detection tools. One tool is “traditional”, relying solely on CPU computation, while the other leverages a large language model (LLM) and makes heavy use of GPU resources.

I’ve already set up my benchmarking environment and can gather detailed execution data for both tools, including energy usage. However, I’m unsure how to approach the modeling, particularly with regard to choosing appropriate distributions and accounting for the different hardware components.

My main questions are:

1- What kind of likelihood distributions are typically used to model energy consumption data (e.g., for CPU and GPU usage separately or jointly)?

2- How can I structure the models to reflect the fact that one tool uses only CPU, while the other uses both CPU and GPU, yet still make their energy consumption comparable in a principled way?

If more context would be helpful, I’d be happy to elaborate.

Best regards

I find it helps to start with the question you’re asking with he model, usually framed in terms of posterior predictive quantities of interest in a Bayesian setting. Do you only care about power consumption or is utility of the two tools going to come into play?

What is the source of the variation in power consumption? Are you running different code or different inputs of the same code or something else? The likelihood is a model of how the output varies, so you need to know about how it’s produced.

Joules is good—that is putting them on the relevant comparable scale. Even if your units are joules, it helps to keep everything roughly unit scaled for computation.

At this stage, my primary focus is on energy consumption.
I plan to assess the effectiveness of the tools (fault finding effectiveness) later using simple descriptive statistics.
For now, I’m only interested in modeling power consumption.

I have a benchmark consisting of 150 programs, ranging from simple toy examples to more complex cases.
I seed faults into these programs and then run both tools to detect them.
During this process, I profile their energy consumption, specifically measuring CPU and GPU usage.

My assumption is that the traditional tool is more energy-efficient, as it does not utilize the GPU and likely places a lighter load on the CPU as well.

Given this context, my main question is: How should I approach modeling this energy consumption?
Would using Normal and HalfNormal distributions, as shown in the code below, be an appropriate starting point? (In the code, I generate synthetic data just to inspect how the models behave.)

with pm.Model() as energy_model:

    # Priors for tool A (CPU only)
    mu_A = pm.Normal("mu_A", mu=50, sigma=10)
    sigma_A = pm.HalfNormal("sigma_A", sigma=5)
    obs_A = pm.Normal("obs_A", mu=mu_A, sigma=sigma_A, observed=cpu_energy_A)

    # Priors for tool B (CPU + GPU)
    mu_cpu_B = pm.Normal("mu_cpu_B", mu=40, sigma=10)
    sigma_cpu_B = pm.HalfNormal("sigma_cpu_B", sigma=5)
    mu_gpu_B = pm.Normal("mu_gpu_B", mu=60, sigma=10)
    sigma_gpu_B = pm.HalfNormal("sigma_gpu_B", sigma=5)

    # Likelihoods for tool B
    obs_cpu_B = pm.Normal("obs_cpu_B", mu=mu_cpu_B, sigma=sigma_cpu_B, observed=cpu_energy_B)
    obs_gpu_B = pm.Normal("obs_gpu_B", mu=mu_gpu_B, sigma=sigma_gpu_B, observed=gpu_energy_B)

    # Derived quantity: total energy of tool B
    mu_total_B = pm.Deterministic("mu_total_B", mu_cpu_B + mu_gpu_B)

    # Posterior sampling
    trace = pm.sample(2000, tune=1000, target_accept=0.95, return_inferencedata=True)

It depends on how efficiently the GPU code is written. GPUs are much more energy efficient per flop than CPUs but the trick is using all those flops efficiently. But, if you don’t optimize the use of the GPU, it won’t consume as much energy.

I’m guessing the main difference is due to using an LLM, not due to using a GPU. If you tried to implement the LLM on a CPU, it’d be way less energy efficient than implementing on GPU.

Given paired data, it’s usually better to do a paired comparison rather than comparing averages because it gives you a better idea of how a random problem will fare under both approaches. Suppose x_n is the time of process A and y_n is the time of process B on problem instance n. Then you can pair the observations and look at differences x_n - y_n. We often want to compare ratios, which we can do by converting to the log scale because \log x_n - \log y_n = \log x_n / y_n. If you give the difference of the log times a normal distribution, it’s equivalent to giving x_n / y_n a lognormal distribution.

1 Like

Hi Bob,

I’ve been trying to wrap my head around your suggestion for comparing the two approaches, and I believe you meant something along the lines of the code below – is that correct?
(I used ChatGPT to help me whip this up)

Would you recommend any changes to it, considering that cpu_energy_A and cpu_energy_B will contain data on the energy consumption for each (fault-detection) approach?

# Log ratios
log_ratios = np.log(cpu_energy_A / cpu_energy_B)

with pm.Model() as paired_log_model:
    # Prior on the mean log-ratio
    mu_diff = pm.Normal("mu_diff", mu=0, sigma=1)
    
    # Prior on the spread of the log-ratios
    sigma_diff = pm.HalfNormal("sigma_diff", sigma=1)
    
    # Likelihood on observed log-ratios
    obs = pm.Normal("obs", mu=mu_diff, sigma=sigma_diff, observed=log_ratios)
    
    # Sampling
    trace = pm.sample(2000, tune=1000, target_accept=0.95, return_inferencedata=True)

Before doing any of the modeling, I’d plot a histogram of log_ratios to see what it looks like. That will inform the observation model. If it’s even remotely close to normal, then the model you wrote looks good. All the model’s doing is fitting a normal distribution to some data with unit-scale priors on the mean and scale.

target_accept of 0.95 is on the high side—this simple model will be optimized for compute closer to a target accept of 0.60. You can play around with this to get a feel for how it works. The higher it is, the smaller the step size has to be in order to preserve the Hamiltonian well enough to hit that acceptance rate. So it’s a tradeoff between smaller steps requiring more work to move the same distance but also being more accurate. The optimal balance intuitively feels like it should be closer to high accuracy. The terminology is also confusing around NUTS—in the multinomial version everyone’s using now, it’s tuning to the average acceptance in the last doubling of NUTS, which means the “will the sampler move in one step” probabilty is almost always much higher than the target accept probability.

2 Likes