Skip to content

Add :greedy scheduler to @threads #52096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 6, 2024

Conversation

Seelengrab
Copy link
Contributor

@Seelengrab Seelengrab commented Nov 9, 2023

This implements a very greedy scheduler for @threads, spawning up to threadpoolsize() tasks that greedily work on the elements from the iterator as they are produced. This scheduler supports infinite iterators.

Needs

  • Tests
  • More extensive docs
  • News
  • thread safety review

@giordano giordano added multithreading Base.Threads and related functionality needs tests Unit tests are required for this change needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Nov 9, 2023
@Seelengrab Seelengrab removed the needs tests Unit tests are required for this change label Nov 9, 2023
@carstenbauer
Copy link
Member

carstenbauer commented Nov 10, 2023

Some initial benchmarks: https://github.com/carstenbauer/parallel-julia-zoo/tree/greedy/multithreading

In particular, see the *.greedy.out files.

@threads :greedy looks good (compared to :static/:dynamic) in the juliaset benchmark (non-uniform workload):

 benchmark
	 serial
  2.187 s (0 allocations: 0 bytes)
	 spawn
  274.958 ms (40019 allocations: 4.03 MiB)
	 threads :dynamic
  544.403 ms (42 allocations: 4.25 KiB)
	 threads :static
  544.766 ms (42 allocations: 4.25 KiB)
	 threads :greedy
  281.027 ms (64 allocations: 6.23 KiB)

The worst result is probably found in the mapreduce_small benchmark (small uniform workload per task). Here the overhead (locking of the channel?) shows very clearly:

# 100*nthreads() many chunks, f = sin
 benchmark
		 serial
  589.951 μs (0 allocations: 0 bytes)
		 spawn
  430.513 μs (4814 allocations: 437.91 KiB)
		 threads :static
  76.664 μs (45 allocations: 10.69 KiB)
		 threads :dynamic
  92.247 μs (45 allocations: 10.69 KiB)
		 threads :greedy
  1.490 ms (72 allocations: 14.13 KiB)

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Nov 10, 2023

100*nthreads() many chunks, f = sin benchmark

To reference the conversation on slack, that benchmark was run with N = 10_000 and 8 threads, so 80_000 elements in the input array or about 64 KiB of data. With 800 chunks, that's a miniscule amount of data per chunk, increasing the overhead dramatically. I'll keep that in mind for the docs of :greedy, as a sort of "when is this a good idea to use".

In general, I don't expect :greedy to be used for workitems that take on the order of milliseconds individually. In that regime, the task spawning & locking overhead can become very noticeable.

@simonbyrne
Copy link
Member

A bit of an aside, but could we actually make @threads extensible somehow? i.e. give it an actual proper interface, e.g. for partitioning iterators, specifying how many items in the iterator to assign to each task, etc.

@Seelengrab
Copy link
Contributor Author

A bit of an aside, but could we actually make @threads extensible somehow?

I was thinking about that too while implementing this, but that's likely something better suited for another PR later. I thought about this in particular because I originally wanted to just have this implementation be a replacement for :dynamic, but it turns out that we actually guarantee the regions for this to be contiguous:

:dynamic (default)
––––––––––––––––––

:dynamic scheduler executes iterations dynamically to available worker threads. Current implementation assumes that the workload for each iteration is uniform. However, this assumption may be removed in the future.

This scheduling option is merely a hint to the underlying execution mechanism. However, a few properties can be expected. The number of Tasks used by :dynamic scheduler is bounded by a small constant multiple of the number of available worker threads (Threads.threadpoolsize()). Each task processes contiguous regions of the iteration space.

Emphasis mine. The problem with this is that we can't then swap the underlying implementation to be greedily work stealing (or otherwise loadbalancing on a per-item basis), since that would result in the regions/items worked on no longer being contiguous for any given task. In essence, this makes :dynamic not dynamic at all, and rather more like :static (except for being properly nestable across threaded regions).

@Seelengrab Seelengrab removed needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Nov 12, 2023
@Seelengrab Seelengrab marked this pull request as ready for review November 12, 2023 12:26
@Seelengrab
Copy link
Contributor Author

Does this need anything else?

@Seelengrab
Copy link
Contributor Author

CI failures seem unrelated. Does this need anything else?

@Seelengrab
Copy link
Contributor Author

No idea what the SparseArrays build/test failures mean. This shouldn't touch them at all. Is this just master being flaky master again? Other than that, @vtjnash if you agree I think this is good to go.

@Seelengrab
Copy link
Contributor Author

What's the status here?

@vtjnash vtjnash added merge me PR is reviewed. Merge when all tests are passing and removed status: waiting for PR reviewer labels Feb 5, 2024
@IanButterworth IanButterworth merged commit 94fd312 into JuliaLang:master Feb 6, 2024
@Seelengrab
Copy link
Contributor Author

Thank you!

@IanButterworth IanButterworth removed the merge me PR is reviewed. Merge when all tests are passing label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants