Using the instruction: https://doc.rust-lang.org/nightly/core/arch/x86_64/fn._mm512_aesdec_epi128.html It should be possible to get 4x the throughput on large strings. Note: Currently very few processors support this.