Skip to content

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

@adrian17

Description

@adrian17

This basically started with investigating byteorder's read_u16-style extensions. They are more or less meant to be used like this:

let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().unwrap());
assert_eq!(768, rdr.read_u16::<BigEndian>().unwrap());

You can see that they are implemented in terms of Read's read_exact method:
https://github.com/BurntSushi/byteorder/blob/f10179b1c72f80d33f4bedf6ad3cf110ba2dac0a/src/io.rs#L110
And similarly in other projects:
https://github.com/image-rs/image-png/blob/ccd274fe028e5cb3247eaaf6d30ab83b5bde76e7/src/traits.rs#L7

However, it seems like when used with Cursor<&[u8]> or just &[u8], they are way slower than I'd expect them to be.

When I build a test program, I encounter this bug: #47321 which is pretty bad, especially as read_exact will call to memcpy for all sizes > 1 byte. This causes, in particular, noticeable overhead on wasm.

When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to read_exact, and there are no calls in inlined code except to io::Error::new. However, even when inlined, it appears that code is still suboptimal.
Normally, I'd assume that cursor.read_16() does a range_check, then a 2-byte read, cursor increment and that's it. But here it appears that more work is being done, probably redundant slice range checks.

Here is my base test case:

#[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
    let mut ret = 0;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u16::<LittleEndian>()? as u64;
    ret += cursor.read_u32::<LittleEndian>()? as u64;
    Ok(ret)
}

In my benchmarks, it spends 7.5ns/call.

But when I manually extract slice range check out of the function, so that the compiler knows the slices will contain required number of bytes, for example:

ret += cursor.read_u16::<LittleEndian>()? as u64;

->

if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = &cursor.get_ref()[cursor.position() as usize..cursor.position() as usize + 2];
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);

The time improves to 5.3ns/call and AFAIK the behavior is identical (aside from different error kind).

This still generated range checks that seemed redundant (I may be wrong here), so after introducing unsafe:

ret += cursor.read_u16::<LittleEndian>()? as u64;

->

if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = unsafe { cursor.get_ref().get_unchecked(cursor.position() as usize..cursor.position() as usize + 2) };
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);

Time improved even more, to 4.1ns/call. (This is also similar time to one I got in equivalent C++ program with a simple range check on every read_u* call)

Similar behavior occurs when I use a slice instead of a cursor. If I simply add a range check:

// data: &mut &[u8]
ret += data.read_u16::<LittleEndian>()? as u64;

->

if data.len() < 2 {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
ret += data.read_u16::<LittleEndian>()? as u64;

The iteration speed improves by nearly 2x.

Now I don't really know how much this has to do with #47321 , and how much with other optimizations. It's also a bit awkward that #47321 triggers in the production code, so until that bug is fixed, I won't really see benefits from fixing this one anyway.

I posted my benchmark on https://github.com/adrian17/rust-byteorder-benchmarks .

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-ioArea: `std::io`, `std::fs`, `std::net` and `std::path`C-enhancementCategory: An issue proposing an enhancement or a PR with one.I-slowIssue: Problems and improvements with respect to performance of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions