Help comparing Rust vs Julia speed

paddyhoran · January 22, 2021, 4:55pm

Hi all,

I got into a discussion with some Actuaries on different implementations to a simple Actuarial modeling problem. We all implemented a simple 10-period net present value. Here is my simple implementation:

pub fn npv(mortality_rates: &Vec<f64>, lapse_rates: &Vec<f64>, interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

    let term = term.unwrap_or_else(|| mortality_rates.len());
    let mut result = 0.0;
    let mut inforce = init_pols;
    let v: f64 = 1.0 / (1.0 + interest_rate);

    for (t, (q, w)) in mortality_rates.iter().zip(lapse_rates).enumerate() {
        let no_deaths = if t < term {inforce * q} else {0.0};
        let no_lapses = if t < term {inforce * w} else {0.0};
        let premiums = inforce * premium;
        let claims = no_deaths * sum_assured;
        let net_cashflow = premiums - claims;
        result += net_cashflow * v.powi(t as i32);
        inforce = inforce - no_deaths - no_lapses;
    }

    result
}

I bench marked it and it ran in 364.35 ns.

Another person re-implemented my version in Julia:

function npv3(q,w,P,S,r,term=nothing)
	term = isnothing(term) ? length(q) + 1 : term + 1
	inforce = 1.0
	result = 0.0
	
	for (t,(q,w)) in enumerate(zip(q,w))
		deaths = t < term ? inforce * q : 0.0
		lapses = t < term ? inforce * w : 0.0
		premiums = inforce * P
		claims = deaths * S
		ncf = premiums - claims
		result += ncf * (1 / ( 1 + r)) ^ (t-1)
		inforce = inforce - deaths - lapses
	end
	
  	return result
end

This user bench marked his version and found that it ran in 135 ns. This is not really the problem to decide between Rust vs Julia but it is interesting. He is using an M1 Mac where Julia runs on Rosetta (I'm on windows 10) so it's hard to compare these two implementations in a meaningful way.

My question is, is there anything obvious that I am missing with the Rust implemenation that could improve it? If anyone has an M1 Mac and could bench mark on their machine that would be really interesting.

scottmcm · January 22, 2021, 5:43pm

A small change that really ought not to matter, but sometimes does:

-pub fn npv(mortality_rates: &Vec<f64>, lapse_rates: &Vec<f64>, interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {
+pub fn npv(mortality_rates: &[f64], lapse_rates: &[f64], interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

And even if it doesn't improve speed, it's better style -- it's more general as it allows passing subslices instead of requiring full vectors, without really any cost (not even syntax at the call site, as &Vec<T> will deref-coerce to &[T] automatically).

Can you share how you got the measurement you did? Anything below about 16ms is extremely sensitive to external stuff, and thus rather hard to benchmark reliably. Obligatory mention of Criterion.rs - Criterion.rs Documentation

paddyhoran · January 22, 2021, 6:06pm

Thanks @scottmcm.

I actually noticed the &[f64] change after I posted it, it didn't make any impact on performance. I used criterion for the benchmark:

#[macro_use]
extern crate criterion;
use criterion::Criterion;

use npv::npv;


fn add_benchmark(c: &mut Criterion) {
    let mortality_rates = vec![0.001, 0.002, 0.003, 0.003, 0.004, 0.004, 0.005, 0.007, 0.009, 0.011];
    let lapse_rates = vec![0.05, 0.07, 0.08, 0.10, 0.14, 0.20, 0.20, 0.20, 0.10, 0.04];
    let premium = 100.0;
    let sum_assured = 25000.0;
    let interest_rate = 0.02;
    c.bench_function("npv", |b| b.iter(|| npv(&mortality_rates, &lapse_rates, interest_rate, sum_assured, premium, 1.0, None)));
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);

I got it down to the 277 ns range by switching to f32 throughout.

erelde · January 22, 2021, 6:24pm

You should ~~probably~~ run the two programs on the same machine to get comparable benchmarks first. Preferably a machine without much else running at the same time. (edit : and of course don't forget the --release flag ?)

gretchenfrage · January 22, 2021, 7:02pm

Seconding that--it took me an amusingly long time to realize that I had to use the --release flag to enable optimizations. You are doing that, right?

paddyhoran · January 22, 2021, 7:11pm

Yes, I'm doing that. I'll have to install Julia and re-run his benchmarks to be able to compare I guess. I was just checking that I wasn't doing anything obviously wrong.

paddyhoran · January 22, 2021, 7:12pm

Thank you everyone

scottmcm · January 22, 2021, 9:56pm

Yeah, seems like you've avoided a bunch of common mistakes, like including the vec construction in what's being benchmarked

trembel · January 22, 2021, 11:47pm

I just benched both (Dell XPS13 9350 running Arch Linux):

Rust: 76ns
Julia: 184ns

paddyhoran · January 23, 2021, 2:36am

Ahh, interesting. Thank you.

Michael-F-Bryan · January 23, 2021, 8:58am

If you know you'll always have sane (i.e. not NaN) inputs and are okay with slightly less accurate results, you can always use the Rust equivalent of the -ffast-math flag provided by most C compilers. These "fast" floating point operations are exposed using intrinsics like std::intrinsics::fadd_fast() and give the compiler more flexibility in reordering operations.

Crates like fast-floats have wrappers around f32/f64 which overload the normal arithmetic operators to use the fast float intrinsics.

jedbrown · January 25, 2021, 11:02pm

In addition to fast math (to enable fused multiply-add and associativity, plus other transformations that may be unsafe/undesirable), I'd also recommend using native compilation flags, as in

RUSTFLAGS='-C opt-level=3 -C target-cpu=native' cargo build --release

Julia isn't making a redistributable binary so it'll JIT for the native architecture.

paddyhoran · January 27, 2021, 7:37pm

I profiled the implementation and found that powi was taking most of the time. I changed to the following impl:

pub fn npv(mortality_rates: &[f64], lapse_rates: &[f64], interest_rate: f64, sum_assured: f64, premium: f64, init_pols: f64, term: Option<usize>) -> f64 {

    let term = term.unwrap_or_else(|| mortality_rates.len());
    let mut result = 0.0;
    let mut inforce = init_pols;
    let v = 1.0 / (1.0 + interest_rate);
    let mut v_t = v;

    for (t, (q, w)) in mortality_rates.iter().zip(lapse_rates).enumerate() {
        let no_deaths = if t < term {inforce * q} else {0.0};
        let no_lapses = if t < term {inforce * w} else {0.0};
        let premiums = inforce * premium;
        let claims = no_deaths * sum_assured;
        let net_cashflow = premiums - claims;
        result += net_cashflow * v_t;
        v_t *= v;
        inforce = inforce - no_deaths - no_lapses;
    }

    result
}

Now it's down to 31 ns .

I know it's not having to re-compute the power in each loop as it's taking the previous power from the previous loop but would there be a difference in the impl of powi on Arch vs windows as well?

paddyhoran · January 27, 2021, 7:38pm

I tried a few different setups in terms of build flags but it doesn't seem to make much difference. Interesting...

scottmcm · January 27, 2021, 7:59pm

Hmm, I wonder if this is because of float non-associativity.

powi is an LLVM intrinsic https://rust.godbolt.org/z/jsdMex that calls a function from compiler-rt, which is (if we're using the C implementations and not rust ones, which I don't remember) implemented like this:

github.com

rust-lang/llvm-project/blob/f9a8d70b6e0365ac2172ca6b7f1de0341297458d/compiler-rt/lib/builtins/powisf2.c#L17-L29


      
          COMPILER_RT_ABI float __powisf2(float a, int b) {
            const int recip = b < 0;
            float r = 1;
            while (1) {
              if (b & 1)
                r *= a;
              b /= 2;
              if (b == 0)
                break;
              a *= a;
            }
            return recip ? 1 / r : r;
          }

That's really good for computing it once, but because of floating point it's technically not allowed for the compiler to replace that with the running version in your alternative implementation, which can accumulate more error.

That said, the intrinsic's definition does say

The order of evaluation of multiplications is not defined.

So maybe it's supposed to be able to do this anyway?

trembel · January 27, 2021, 8:01pm

Also for me it went down to 31ns.

matski · January 30, 2021, 2:05pm

Interesting benchmark.

I tried rewriting it into a more functional approach with iterators and (unless I've made a mistake with the code) that improves the speed as well.

With the code above, I get this result from criterion:

npv                     time:   [37.158 ns 37.368 ns 37.595 ns]                 
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

pub fn npv(
    mortality_rates: &[f64],
    lapse_rates: &[f64],
    interest_rate: f64,
    sum_assured: f64,
    premium: f64,
    init_pols: f64,
    term: Option<usize>,
) -> f64 {
    let term = term.unwrap_or_else(|| mortality_rates.len());
    let v = 1.0 / (1.0 + interest_rate);

    mortality_rates
        .iter()
        .zip(lapse_rates)
        .enumerate()
        .fold((0.0, v, init_pols), |(accum, v_t, inforce), (t, (q, w))| {
            let no_deaths = if t < term { inforce * q } else { 0.0 };
            let no_lapses = if t < term { inforce * w } else { 0.0 };
            let premiums = inforce * premium;
            let claims = no_deaths * sum_assured;
            let net_cashflow = premiums - claims;
            (
                accum + net_cashflow * v_t,
                v_t * v,
                inforce - no_deaths - no_lapses,
            )
        })
        .1
}

and the result with criterion:

npv                     time:   [8.0005 ns 8.1359 ns 8.2846 ns]                 
                        change: [-77.409% -77.077% -76.736%] (p = 0.00 < 0.05)
                        Performance has improved.

Please tell me that I've made a mistake, I didn't expect that much difference from this rewrite.

erelde · January 30, 2021, 5:33pm

An easy way to check if your refactoring changed the output would be to fuzz the old and new implementation against each other. (though I think refactoring the code was a bit out of scope here, since the goal was to compare the exact same algorithm ^^)

SkiFire13 · January 30, 2021, 9:45pm

Sorry but you did, after the fold you select the second value of the result (equivalent to v_t), but you actually needed to select the first by using .0 instead of .1. Since v_t is only multiplied by a constant the compiler can remove all the other stuff, which of course is much faster.

jhpratt · January 31, 2021, 12:31am

Can I just point out that it's absolutely awesome that someone goes "hey, why is this code slow?", and the community spends over a week optimizing the living hell out of it just to show that it's possible?

Topic		Replies	Views
A performance problem compared with Julia	22	3301	February 23, 2021
Auto-vectorization in Rust help	16	7534	January 12, 2023
Rust autovectorization issues help	9	303	May 31, 2025
Rust vs Julia language community	37	13767	July 3, 2022
Performance issue with C-array like computation (2 times worst than naive java) help	48	6476	January 12, 2023

Help comparing Rust vs Julia speed

Related topics