Skip to content

Add optimized String.Join for IList<string> #44032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 1, 2020

Conversation

johnthcall
Copy link
Contributor

When joining IEnumerable check whether it is a IList and if so go through the path used for string[].
Fixes #44027

@dnfadmin
Copy link

dnfadmin commented Oct 29, 2020

CLA assistant check
All CLA requirements met.

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, but logic overall LGTM.

@@ -629,6 +629,10 @@ public static string Join(string? separator, IEnumerable<string?> values)
{
throw new ArgumentNullException(nameof(values));
}
if (values is IList<string?> valuesIList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider instead checking IReadOnlyList<T>. (List<T> and T[] both implement IReadOnlyList<T>.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the other thread I am now checking for both List and T[] so we can convert to ROS.

@GrabYourPitchforks GrabYourPitchforks added area-System.Runtime tenet-performance Performance related issue enhancement Product code improvement that does NOT require public API changes/additions and removed area-Infrastructure-coreclr labels Oct 29, 2020
@GrabYourPitchforks GrabYourPitchforks added this to the 6.0.0 milestone Oct 29, 2020
@@ -761,7 +777,7 @@ private static unsafe string JoinCore<T>(char* separator, int separatorLength, I
}
}

private static unsafe string JoinCore(char* separator, int separatorLength, string?[] value, int startIndex, int count)
private static unsafe string JoinCore(char* separator, int separatorLength, IList<string?> value, int startIndex, int count)
Copy link
Member

@stephentoub stephentoub Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked this? In theory this could regress existing usage with arrays, as now every access is via an interface dispatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinged John offline with tips on using our performance repo's benchmarking suite. John - if you have any other questions re: benchmarking feel free to drop them in the conversation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does appear to be slower, do you see a way to maintain the performance without duplicating this method?

Method Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
ArrayTest new 85.41 ns 5.792 ns 6.438 ns 85.05 ns 74.32 ns 96.83 ns 1.00 0.00 0.0185 - - 80 B
ArrayTest old 49.66 ns 1.659 ns 1.775 ns 49.87 ns 47.21 ns 52.95 ns 0.58 0.04 0.0185 - - 80 B

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to special-case string[] and List<string> we could convert both to ReadOnlySpan<string>, which should have array-equivalent performance characteristics. That doesn't help other concrete types that implement IList<string>, though. (Is this common?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Levi's suggestion (and probably just for List<string>, since I think it's unlikely you're in a situation where you're getting a string[] strongly typed as an IEnumerable<string> and care about this level of optimization). Is there another input that's inspiring you here?

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the ROS<string?> solution would look something like this:

private static void JoinCore(..., ReadOnlySpan<string?> values, ...) { /* ... */ }

IEnumerable<string?> enumerable = ...;
if (enumerable is string?[] arr) { return JoinCore(..., arr, 0, arr.Length); }
else if (enumerable is List<string?> list) { return JoinCore(..., CollectionsMarshal.AsSpan(list), 0, list.Count); }
else { /* fall back to existing enum-based logic */ }

This takes advantage of the fact that T[] is implicitly convertible to ROS<T>, and List<T> is convertible to ROS<T> via the CollectionsMarshal.AsSpan API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (enumerable is string?[] arr) { return JoinCore(..., arr, 0, arr.Length); }

Is this check even necessary? As noted, I think it's unlikely you're in a situation where you're getting a string[] strongly typed as an IEnumerable<string> and care about this level of optimization. If you have a strongly-typed string[], you'll bind to the existing string[]-based overload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in my teams case we often have both arrays and lists passed as IList. I'm now realizing that this isn't as performant as having it kept as T[] all the way back. I agree that having a string[] typed as a IEnumerable is cases it's types as anything other than string[] there will be a significant perf improvement. I'm happy to remove the enumerable is string?[] arr case, let me know what you'd like to do.

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD
ArrayTest Job-EOKGEP old 49.50 ns 2.105 ns 2.252 ns 49.43 ns 45.51 ns 53.95 ns 1.00 0.00
ArrayTest Job-TTJIGM new 49.90 ns 1.592 ns 1.833 ns 50.21 ns 46.79 ns 52.93 ns 1.01 0.06
List Job-EOKGEP old 125.01 ns 4.675 ns 5.196 ns 123.35 ns 118.42 ns 137.19 ns 1.00 0.00
List Job-TTJIGM new 65.14 ns 2.381 ns 2.742 ns 65.01 ns 60.65 ns 70.58 ns 0.52 0.02
ArrayAsIList Job-EOKGEP old 88.90 ns 2.787 ns 2.983 ns 88.37 ns 84.83 ns 95.60 ns 1.00 0.00
ArrayAsIList Job-TTJIGM new 54.17 ns 1.419 ns 1.577 ns 54.04 ns 51.30 ns 57.32 ns 0.61 0.03
IEnumerable Job-EOKGEP old 206.12 ns 17.513 ns 20.168 ns 202.55 ns 177.56 ns 240.59 ns 1.00 0.00
IEnumerable Job-TTJIGM new 190.81 ns 6.470 ns 7.451 ns 192.75 ns 177.33 ns 201.28 ns 0.93 0.09

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers look good to me. I don't really have any other feedback on the code. @stephentoub - did the latest iteration address your concerns?

(There's a little more cleanup we can do, such as nixing the startIndex and count parameters and removing some uses of unsafe, but that's best left for a followup PR.)

@@ -761,7 +784,7 @@ private static unsafe string JoinCore<T>(char* separator, int separatorLength, I
}
}

private static unsafe string JoinCore(char* separator, int separatorLength, string?[] value, int startIndex, int count)
private static unsafe string JoinCore(char* separator, int separatorLength, ReadOnlySpan<string?> value, int startIndex, int count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this takes a span, are startIndex and count needed? Callers can slice if necessary. And this implementation will then be faster as it'll be able to iterate over the full size of the span and bounds checks will be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In public static unsafe string Join(char separator, string?[] value, int startIndex, int count) would i do value.AsSpan(startIndex, count)?

Copy link
Member

@stephentoub stephentoub Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

And now that I look at the implementation more carefully to answer your question, I'm realizing the change still needs further work, anyway. Spans can never be null, so the change to the type of value is removing a null check that should have been there for the public static unsafe string Join(string? separator, string?[] value, int startIndex, int count) overload. By doing the AsSpan, we'd similarly need to move the argument validation for startIndex and count to the relevant entry point(s) that need the arguments validated.

(Seems then as well like we're missing a test that would have caught that null check?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move this and other public overloads to pass down the value.AsSpan, cleanup the JoinCore implementation, and benchmark again to verify there are no regressions. Do we want to move the argument validation of startIndex and count to the public overloads even though value.ToSpan(startIndex, count) is doing the same validations (with less helpful error messages)?

Regarding the missing test there is already AssertExtensions.Throws<ArgumentNullException>("value", () => string.Join('|', (string[])null, 0, 0)); I'll RC why this wasn't failing.

@GrabYourPitchforks
Copy link
Member

I suppose I'll dissent a little bit. Per my earlier comment, I don't think this PR is the right place for the JoinCore refactoring that was done to drop the startIndex and count parameters, because the real perf gains (bounds check elision, etc.) won't come in until we also change things like the for loop inside that method. And now we're introducing significant enough churn and duplicating enough code that we're losing sight of what the PR was originally supposed to be for: reducing allocations in the case where somebody called string.Join(..., new List<string>(...)). I would have preferred to keep this type of refactoring in a separate PR, because then we could also make other improvements like removing the use of unsafe.

That said, the PR as it stands now looks fine, but if we want to address concerns like bounds check elision and removal of unsafe code we'll still need to submit a follow-up.

@stephentoub
Copy link
Member

stephentoub commented Oct 31, 2020

I wasn't suggesting it for perf gains; bounds checking elisions is a bonus. Rather, than change is adding complexity, and it's the antithesis of span to pass one and also pass bounds. In switching to span, we shouldn't incur debt to be cleaned up later. It also highlighted that the change to use span was erroneous without moving additional argument validation, which is part of the refactoring to which you refer.

@stephentoub
Copy link
Member

but if we want to address concerns like bounds check elision and removal of unsafe code we'll still need to submit a follow-up

Yes. Though I'd be fine if we did it as part of this change, too. It's all touching the same code; a split is a bit artificial. But either way.

@stephentoub
Copy link
Member

Thanks, @johnthcall.

I'll follow-up on my comments.

@stephentoub stephentoub merged commit f1de95e into dotnet:master Nov 1, 2020
@johnthcall johnthcall deleted the issue-44027 branch November 1, 2020 18:30
@ghost ghost locked as resolved and limited conversation to collaborators Dec 6, 2020
@@ -625,6 +642,16 @@ public static unsafe string Join<T>(string? separator, IEnumerable<T> values)

public static string Join(string? separator, IEnumerable<string?> values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider also doing these checks in the:

public static string Join(char separator, IEnumerable<T> values)

overload? I'm a little concerned now without that we've created a cliff. Imagine someone writes:

string result = string.Join(",", strings);

and then realizes they can reduce that to:

string result = string.Join(',', strings);

They just lost this optimization.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add optimized String.Join for IList<string>
5 participants