-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT: Cleanup for histogram bin estimator selection #7199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm wondering if it would make sense to move all the histogram related functions into a new |
Is there any reason to make these functions public? My bias is to just add an underscore to all of them and leave them be, rather than commit to supporting them as part of the public API. |
@njsmith My thought also. I'm thinking of how we want to handle the documentation, somewhere we should have an explanation of what the different choices do, references are probably not enough. |
I think that long-term it will be useful to have the estimators as public functions. They provide a useful tool that can be used in situations where just knowing the number of bins is desirable (e. g., when calling Short term, I don't think I have the time to write all of the required documentation or do justice to the refactoring, so I will prepend the estimator names with underscores for now. I will keep a branch open in my fork for an eventual PR. |
dd502ca
to
58f1c15
Compare
@charris I was planning to evenually create a page in the docs called |
FYI I think the way calling matplotlib.hist will work is that they'll On Sat, Feb 6, 2016 at 2:45 PM, Mad Physicist [email protected]
Nathaniel J. Smith -- https://vorpus.org http://vorpus.org |
I just figured that out too. It wasn't apparent from the version of the documentation I was looking at. Perhaps it would be worth pinging numpy-discussion to see if anyone besides me would find something like this useful? |
☔ The latest upstream changes (presumably #7181) made this pull request unmergeable. Please resolve the merge conflicts. |
58f1c15
to
d005f60
Compare
Squawk. |
☔ The latest upstream changes (presumably #7090) made this pull request unmergeable. Please resolve the merge conflicts. |
d005f60
to
ecc78ba
Compare
☔ The latest upstream changes (presumably #7241) made this pull request unmergeable. Please resolve the merge conflicts. |
ecc78ba
to
b51a8f1
Compare
b51a8f1
to
43292aa
Compare
☔ The latest upstream changes (presumably #7243) made this pull request unmergeable. Please resolve the merge conflicts. |
639274d
to
e73bc13
Compare
I have fixed up this branch to include the contents of #7243. The main change is to put all the |
data and inversely proportional to the cube root of data size | ||
(asymptotically optimal). | ||
""" | ||
h = (24 * np.pi**0.5 / x.size)**(1.0 / 3) * np.std(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not overly sure where the formula changed from 3.5 to this, but the above evaluates to roughly 3.44 which is good enough.
Since your PR has a really sucky diff, it's far too easy to miss things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I changed it to this and now apparently clobbered it back. And reverted it again now. I traced the origins of the "3.5" from the Wikipedia article in this paper to the formula cbrt(24 * sqrt(pi))
(formula for h-hat-nought on page 5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a very careful diff because of this and I am pretty sure that I have reverted all of my inadvertent clobberings from before.
Private function with superfluous checks was removed. Estimator Estimator functions are now semi-public and selection is simplified within histogram itself.
e73bc13
to
90adbbf
Compare
I'm glad to see that this has been getting active development. I was actually kind of shocked to see changes made to it, I sort of assumed my contribution would sit in a corner and get used by viz libraries, so thanks for that @madphysicist. Things I like:
Not so sold on:
Things I don't like
Overall, I'd rather leave However, I like the upgraded docs and the cleaner logic flow you've put together, I think that's a very welcome addition. |
I started working on this PR in response to comments made by @charris to #7193. I was not aware of the work you were doing at the time, but I am glad I had a chance to see it. I too would like to see these functions exposed one day, since I agree that they would be useful for someone that knows what they are. That may or may not happen incrementally. Response on the mailing list was overwhelmingly negative the first time I proposed the idea. For now, someone that knows what they really are will have to dig into the code and use the underscored name, since that is accessible in Python, even if it is a hack. Adding the functions to a separate file is not unreasonable. I think having a misc module will cause maintenance problems later, but I like the idea Charles threw in (#7199 (comment)): that we could have a separate module for all the histogramming functions and their infrastructure. It is better to have the dictionary instantiated at the module level because it will only happen for the first call to |
squawk |
So to be clear, at this point this PR has no change in actual behavior, it just moves the internal functions out into global |
Yes. There are no externally visible changes as far as I am aware beyond possibly import/execution times. |
LGTM then, thanks @madphysicist! |
MAINT: Cleanup for histogram bin estimator selection
This is a followup to PR #7193.
The formerly private internal histogram bin estimators are now individual functions with docstrings and all. The functions now all return integers. Selection is done via a module-level dictionary that is only initialized once and used directly by
histogram
. The code is probably more legible because it is easier to trace. No really new code has been added (but PR #7090, which will need to be rebased extensively).At the moment, the newly refactored functions are in limbo: they are not private but are also not in all or in the official docs. I believe that they contain information useful enough to go full-public, but I will leave that up to the reviewers.