Skip to content

WIP: Use calloc for zeros #19852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

danielmatz
Copy link
Contributor

@danielmatz danielmatz commented Jan 4, 2017

This resolves #130. The main goal is to use calloc when creating large arrays of zeros (only opting in to this behavior for appropriate eltypes). But I had to make a number of other changes along the way:

  • Small arrays of zeros are allocated as before, but zeroed using memset
  • Renamed jl_malloc (and friends) to jl_gc_malloc to disambiguate from jl_malloc_aligned, which allocates memory that isn't gc counted
  • Added jl_gc_malloc_aligned (and friends)
  • Removed jl_gc_managed_malloc, the old gc counted, aligned malloc, which didn't have a matching calloc
  • Freed up a bit in jl_array_flags_t by tracking whether the memory is aligned or not in mallocarray_t instead

Before I submit this, I'd still like to do the following:

  • Check that these changes do indeed make things faster! (need to check both large and small arrays)
  • Add special casing to jl_gc_malloc_aligned and jl_gc_calloc_aligned
  • Add checks on the alignment argument (?)
  • When using jl_gc_realloc_aligned, be smarter and only memcpy the memory being used by the array
  • Double check the zeros tests
  • Use calloc for more eltypes

@danielmatz danielmatz force-pushed the use_calloc_for_zeros branch from b0f306e to 04a6f84 Compare January 5, 2017 05:00
@kshyatt kshyatt added the performance Must go faster label Jan 5, 2017
@danielmatz danielmatz force-pushed the use_calloc_for_zeros branch 2 times, most recently from b280137 to 984051a Compare January 23, 2017 04:27
@danielmatz
Copy link
Contributor Author

I know I need to update this branch to fix the conflicts, but I have also been meaning to post some benchmark results. Let me do the latter now. I'll find time soon to fix the conflicts. I also need to figure out why the tests fail on systems other than macOS.

I benchmarked both zeros(Float64, s) and Array(Float64, s) for a variety of sizes. The 2048 and 2049 sizes are right at the breakpoint in _new_array_ where it switches from allocating the buffer inline with the array or separately with malloc or calloc.

I'm comparing master to the point in my PR after I've switched array creation to use jl_gc_malloc and then to the full PR which uses calloc for zeros.

Array(Float64, s):

size master (min/median) gc malloc change (min/median) full pr (min/median)
10 110 ns / 120 ns 110 ns / 130 ns 110 ns / 120 ns
2048 240 ns / 1.7 µs 250 ns / 1.4 µs 240 ns / 1.3 µs
2049 240 ns / 1.5 µs 270 ns / 2.1 µs 390 ns / 2.1 µs
10000 1.4 µs / 5.5 µs 600 ns / 2.4 µs 760 ns / 2.4 µs
100000 7.3 µs / 8.7 µs 3.5 µs / 3.9 µs 3.6 µs / 3.9 µs
1000000 300 µs / 340 µs 1.8 µs / 4.1 µs 2.4 µs / 4.1 µs

zeros(Float64, s):

size master (min/median) gc malloc change (min/median) full pr (min/median)
10 42 ns / 56 ns 43 ns / 110 ns 38 ns / 100 ns
2048 1.2 µs / 1.9 µs 1.8 µs / 7.6 µs 1.9 µs / 7.5 µs
2049 1.5 µs / 2.8 µs 2.2 µs / 7.5 µs 2.3 µs / 7.5 µs
10000 6.1 µs / 13 µs 10 µs / 34 µs 9.9 µs / 34 µs
100000 91 µs / 230 µs 100 µs / 370 µs 3.4 µs / 3.6 µs
1000000 1.0 ms / 2.2 ms 1.1 ms / 3.6 ms 3.6 µs / 4.0 µs

Did you all expect the gc malloc change to so drastically improve the times for large arrays?

Is the slight degradation for small malloced arrays (size = 2049) OK?

The good news is that using calloc does help a lot for large arrays of zeros.

@danielmatz danielmatz force-pushed the use_calloc_for_zeros branch from 984051a to d512d83 Compare March 11, 2017 17:58
@ViralBShah
Copy link
Member

Another good thing to check on is the performance of the shootout and the micro benchmarks. The slight degradation for 2049 appears ok to me for all the other gains.

@danielmatz
Copy link
Contributor Author

Here are the benchmark results:

(Any["micro", "quicksort"], TrialJudgement(+0.17% => invariant))
(Any["micro", "parseint"], TrialJudgement(-1.27% => invariant))
(Any["micro", "randmatstat"], TrialJudgement(+1.40% => invariant))
(Any["micro", "pisum"], TrialJudgement(+0.02% => invariant))
(Any["micro", "fib"], TrialJudgement(-0.73% => invariant))
(Any["micro", "randmatmul"], TrialJudgement(+9.96% => invariant))
(Any["micro", "mandel"], TrialJudgement(+0.42% => invariant))
(Any["shootout", "k_nucleotide"], TrialJudgement(+4.76% => invariant))
(Any["shootout", "fasta"], TrialJudgement(+1.26% => invariant))
(Any["shootout", "spectralnorm"], TrialJudgement(-0.00% => invariant))
(Any["shootout", "revcomp"], TrialJudgement(+1.61% => invariant))
(Any["shootout", "pidigits"], TrialJudgement(+49.25% => regression))
(Any["shootout", "binary_trees"], TrialJudgement(+11.79% => invariant))
(Any["shootout", "nbody"], TrialJudgement(+0.01% => invariant))
(Any["shootout", "regex_dna"], TrialJudgement(+8.55% => invariant))
(Any["shootout", "meteor_contest"], TrialJudgement(+7.98% => invariant))
(Any["shootout", "mandelbrot"], TrialJudgement(+0.28% => invariant))
(Any["shootout", "fannkuch"], TrialJudgement(-0.63% => invariant))
(Any["shootout", "nbody_vec"], TrialJudgement(+0.10% => invariant))

The pidigits regression is caused by spending a lot more time in the BigInt() constructor, which is getting called by <<(::BigInt, ::UInt64). I'm not sure yet why my changes would cause that...

@vtjnash vtjnash added this to the 1.0 milestone May 1, 2017
@JeffBezanson JeffBezanson self-assigned this Jul 13, 2017
@JeffBezanson
Copy link
Member

Looks good. I'm going to rebase and take all but the last two commits, and implement #9147 instead, which will be simpler since we won't need the extra argument to request zeros.

@JeffBezanson
Copy link
Member

@danielmatz Please review #22953 if you get a chance. I think that part should be ok to merge and you can submit future PRs to do the follow-up items if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement zeros() by calling calloc
5 participants