Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]>

This basically started with investigating byteorder's `read_u16`-style extensions. They are more or less meant to be used like this:

```rust
let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().unwrap());
assert_eq!(768, rdr.read_u16::<BigEndian>().unwrap());
```

You can see that they are implemented in terms of `Read`'s `read_exact` method:
https://github.com/BurntSushi/byteorder/blob/f10179b1c72f80d33f4bedf6ad3cf110ba2dac0a/src/io.rs#L110
And similarly in other projects:
https://github.com/image-rs/image-png/blob/ccd274fe028e5cb3247eaaf6d30ab83b5bde76e7/src/traits.rs#L7

However, it seems like when used with `Cursor<&[u8]>` or just `&[u8]`, they are way slower than I'd expect them to be.

When I build a test program, I encounter this bug: https://github.com/rust-lang/rust/issues/47321 which is pretty bad, especially as `read_exact` will call to `memcpy` for all sizes > 1 byte. This causes, in particular, noticeable overhead on wasm.

When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to `read_exact`, and there are no calls in inlined code except to `io::Error::new`. However, even when inlined, it appears that code is still suboptimal.
Normally, I'd assume that `cursor.read_16()` does a range_check, then a 2-byte read, cursor increment and that's it. But here it appears that more work is being done, probably redundant slice range checks.

Here is my base test case:

```rust
#[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
    let mut ret = 0;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u16::<LittleEndian>()? as u64;
    ret += cursor.read_u32::<LittleEndian>()? as u64;
    Ok(ret)
}
```

In my benchmarks, it spends 7.5ns/call.

But when I manually extract slice range check out of the function, so that the compiler knows the slices will contain required number of bytes, for example:

```rust
ret += cursor.read_u16::<LittleEndian>()? as u64;
```

->

```rust
if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = &cursor.get_ref()[cursor.position() as usize..cursor.position() as usize + 2];
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);
```

The time improves to 5.3ns/call and AFAIK the behavior is identical (aside from different error kind).

This still generated range checks that seemed redundant (I may be wrong here), so after introducing `unsafe`:

```rust
ret += cursor.read_u16::<LittleEndian>()? as u64;
```

->

```rust
if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = unsafe { cursor.get_ref().get_unchecked(cursor.position() as usize..cursor.position() as usize + 2) };
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);
```

Time improved even more, to 4.1ns/call. (This is also similar time to one I got in equivalent C++ program with a simple range check on every `read_u*` call)

Similar behavior occurs when I use a slice instead of a cursor. If I simply add a range check:

```rust
// data: &mut &[u8]
ret += data.read_u16::<LittleEndian>()? as u64;
```
->
```rust
if data.len() < 2 {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
ret += data.read_u16::<LittleEndian>()? as u64;
```
The iteration speed improves by nearly 2x.


Now I don't really know how much this has to do with https://github.com/rust-lang/rust/issues/47321 , and how much with other optimizations. It's also a bit awkward that https://github.com/rust-lang/rust/issues/47321 triggers in the production code, so until that bug is fixed, I won't really see benefits from fixing this one anyway.

I posted my benchmark on https://github.com/adrian17/rust-byteorder-benchmarks .


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions