Skip to content

MAINT: Cleanup for histogram bin estimator selection #7199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 23, 2016

Conversation

madphysicist
Copy link
Contributor

This is a followup to PR #7193.

The formerly private internal histogram bin estimators are now individual functions with docstrings and all. The functions now all return integers. Selection is done via a module-level dictionary that is only initialized once and used directly by histogram. The code is probably more legible because it is easier to trace. No really new code has been added (but PR #7090, which will need to be rebased extensively).

At the moment, the newly refactored functions are in limbo: they are not private but are also not in all or in the official docs. I believe that they contain information useful enough to go full-public, but I will leave that up to the reviewers.

@charris
Copy link
Member

charris commented Feb 6, 2016

I'm wondering if it would make sense to move all the histogram related functions into a new histogram.py module in numpy/lib/. I think there are now about 700 lines of relevant code and a number of functions, so it might help to group everything together.

@njsmith
Copy link
Member

njsmith commented Feb 6, 2016

Is there any reason to make these functions public? My bias is to just add an underscore to all of them and leave them be, rather than commit to supporting them as part of the public API.

@charris
Copy link
Member

charris commented Feb 6, 2016

@njsmith My thought also. I'm thinking of how we want to handle the documentation, somewhere we should have an explanation of what the different choices do, references are probably not enough.

@madphysicist
Copy link
Contributor Author

I think that long-term it will be useful to have the estimators as public functions. They provide a useful tool that can be used in situations where just knowing the number of bins is desirable (e. g., when calling matplotlib.hist).

Short term, I don't think I have the time to write all of the required documentation or do justice to the refactoring, so I will prepend the estimator names with underscores for now. I will keep a branch open in my fork for an eventual PR.

@madphysicist madphysicist force-pushed the histogram-estimator-dict branch from dd502ca to 58f1c15 Compare February 7, 2016 00:07
@madphysicist
Copy link
Contributor Author

@charris I was planning to evenually create a page in the docs called histogram_bin_estimation.rst or something like that dedicated to explaining the estimators, why they are there and what they do. But as I said, that is probably not going to happen in the next few days.

@njsmith
Copy link
Member

njsmith commented Feb 7, 2016

FYI I think the way calling matplotlib.hist will work is that they'll
support passing the estimator names directly (and pass them onto
np.histogram)

On Sat, Feb 6, 2016 at 2:45 PM, Mad Physicist [email protected]
wrote:

I think that long-term it will be useful to have the estimators as public
functions. They provide a useful tool that can be used in situations where
just knowing the number of bins is desirable (e. g., when calling
matplotlib.hist).

Short term, I don't think I have the time to write all of the required
documentation or do justice to the refactoring, so I will prepend the
estimator names with underscores for now. I will keep a branch open in my
fork for an eventual PR.


Reply to this email directly or view it on GitHub
#7199 (comment).

Nathaniel J. Smith -- https://vorpus.org http://vorpus.org

@madphysicist
Copy link
Contributor Author

I just figured that out too. It wasn't apparent from the version of the documentation I was looking at. Perhaps it would be worth pinging numpy-discussion to see if anyone besides me would find something like this useful?

@homu
Copy link
Contributor

homu commented Feb 7, 2016

☔ The latest upstream changes (presumably #7181) made this pull request unmergeable. Please resolve the merge conflicts.

@madphysicist madphysicist force-pushed the histogram-estimator-dict branch from 58f1c15 to d005f60 Compare February 7, 2016 18:27
@madphysicist
Copy link
Contributor Author

Squawk.

@homu
Copy link
Contributor

homu commented Feb 13, 2016

☔ The latest upstream changes (presumably #7090) made this pull request unmergeable. Please resolve the merge conflicts.

@madphysicist madphysicist force-pushed the histogram-estimator-dict branch from d005f60 to ecc78ba Compare February 13, 2016 19:18
@homu
Copy link
Contributor

homu commented Feb 13, 2016

☔ The latest upstream changes (presumably #7241) made this pull request unmergeable. Please resolve the merge conflicts.

@homu
Copy link
Contributor

homu commented Feb 15, 2016

☔ The latest upstream changes (presumably #7243) made this pull request unmergeable. Please resolve the merge conflicts.

@madphysicist madphysicist force-pushed the histogram-estimator-dict branch 2 times, most recently from 639274d to e73bc13 Compare February 16, 2016 05:19
@madphysicist
Copy link
Contributor Author

I have fixed up this branch to include the contents of #7243. The main change is to put all the range and weights back into histogram itself instead of having them in a separate function, as we discussed. I think the change is simple, but I would still feel better if @nayyarv took a quick glance at this. I will also shoot off an email about adding weights to percentile and median, since I think that is something that really would help round off this task.

data and inversely proportional to the cube root of data size
(asymptotically optimal).
"""
h = (24 * np.pi**0.5 / x.size)**(1.0 / 3) * np.std(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not overly sure where the formula changed from 3.5 to this, but the above evaluates to roughly 3.44 which is good enough.

Since your PR has a really sucky diff, it's far too easy to miss things like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I changed it to this and now apparently clobbered it back. And reverted it again now. I traced the origins of the "3.5" from the Wikipedia article in this paper to the formula cbrt(24 * sqrt(pi)) (formula for h-hat-nought on page 5).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a very careful diff because of this and I am pretty sure that I have reverted all of my inadvertent clobberings from before.

Private function with superfluous checks was removed. Estimator
Estimator functions are now semi-public and selection is simplified
within histogram itself.
@madphysicist madphysicist force-pushed the histogram-estimator-dict branch from e73bc13 to 90adbbf Compare February 16, 2016 12:51
@nayyarv
Copy link
Contributor

nayyarv commented Feb 16, 2016

I'm glad to see that this has been getting active development. I was actually kind of shocked to see changes made to it, I sort of assumed my contribution would sit in a corner and get used by viz libraries, so thanks for that @madphysicist.

Things I like:

  1. Having the functions exposed (even as private functions should allow for them to be called by anyone who knows what they are and may want to use them) is nice.
  2. The cleaner filtering and decision logic at the end is much cleaner in general, rather than having some at the start and the rest at the end.
  3. The docs are more standardised and clear.

Not so sold on:

  1. Using the if bins not in _hist_bin_selectors: raise ValueError("{0} not a valid estimator..") style of checking is very C like, not very pythonic - exception handling is preferred (ask for forgiveness), and is by far the most common paradigm seen. However, numpy does follow the 'look before you leap' format a lot, so it's not a huge thing

Things I don't like

  1. Exposing the functions in numpy.lib.function_base. There are already ~44 functions in function_base and it's ~4500 lines long or so. These functions are used nowhere else and are too simple to be private functions in function_base. Ideally a hist_optim.py would be the usual design choice, but I don't think that should happen either in this case as numpy already has a fair few files and creating new modules for everything is also bad. I'd actually like to see something like a misc for stuff like this (where we could expose these more publicly), sort of like scipy's catchall, but this also makes arguing for parsimony more difficult in general.
    My original function was written to mimic a full module, since I never wanted the simple one liners in function_base since they served no other purpose, but I didn't want to write extensive if..elif blocks for the switching logic and I wanted the estimators to be self documented as much as possible.
  2. Having the dictionary instantiation in function_base.py. This will be instantiated every time import numpy is run, which is a waste since a majority of sessions aren't going to use np.histogram. I think the dictionary instantiation should only happen when a user requests an auto method - I'd rather have more latency on the few occasions it happens than small latency every time for everyone.

Overall, I'd rather leave _hist_optim_numbins_estimator's structure alone for the time being - i.e. the functions within the function as well as the dictionary instantiation inside the function. I don't think the restructure is an upgrade from the existing and while theoretically sound, isn't overly pragmatic.

However, I like the upgraded docs and the cleaner logic flow you've put together, I think that's a very welcome addition.

@madphysicist
Copy link
Contributor Author

I started working on this PR in response to comments made by @charris to #7193. I was not aware of the work you were doing at the time, but I am glad I had a chance to see it.

I too would like to see these functions exposed one day, since I agree that they would be useful for someone that knows what they are. That may or may not happen incrementally. Response on the mailing list was overwhelmingly negative the first time I proposed the idea. For now, someone that knows what they really are will have to dig into the code and use the underscored name, since that is accessible in Python, even if it is a hack.

Adding the functions to a separate file is not unreasonable. I think having a misc module will cause maintenance problems later, but I like the idea Charles threw in (#7199 (comment)): that we could have a separate module for all the histogramming functions and their infrastructure.

It is better to have the dictionary instantiated at the module level because it will only happen for the first call to import numpy. The only way to get it to execute that code would be to use reload in Py2 and importlib.reload in Py3.

@madphysicist
Copy link
Contributor Author

squawk

@njsmith
Copy link
Member

njsmith commented Feb 23, 2016

So to be clear, at this point this PR has no change in actual behavior, it just moves the internal functions out into global _underscore_prefixed functions?

@madphysicist
Copy link
Contributor Author

Yes. There are no externally visible changes as far as I am aware beyond possibly import/execution times.

@njsmith
Copy link
Member

njsmith commented Feb 23, 2016

LGTM then, thanks @madphysicist!

njsmith added a commit that referenced this pull request Feb 23, 2016
MAINT: Cleanup for histogram bin estimator selection
@njsmith njsmith merged commit 9fe706b into numpy:master Feb 23, 2016
@madphysicist madphysicist deleted the histogram-estimator-dict branch February 23, 2016 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants