-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: groupby.agg returns incorrect results for uint64 cols (#26310) #26359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. this needs a bit of testing; my comments should be straightforward to handle.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -258,6 +258,7 @@ Performance Improvements | |||
Bug Fixes | |||
~~~~~~~~~ | |||
|
|||
- Bug where groupby.agg (first, last, min, etc...) returns incorrect results for uint64 columns. (:issue:`26310`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to grouping use full references (see other issues on how to do this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/dtypes/common.py
Outdated
""" | ||
try: | ||
return arr.astype('int64', copy=copy, casting='safe') | ||
except TypeError: | ||
return arr.astype('float64', copy=copy) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a pass rather than a nested try/except
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/dtypes/common.py
Outdated
@@ -90,9 +90,11 @@ def ensure_categorical(arr): | |||
def ensure_int64_or_float64(arr, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this (ArrayLike) -> np.array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -313,3 +313,14 @@ def test_order_aggregate_multiple_funcs(): | |||
expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min']) | |||
|
|||
tm.assert_index_equal(result, expected) | |||
|
|||
|
|||
def test_uint64_type_handling(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a battery of tests in pandas/tests/dtypes/test_common.py I don't think we have this tested at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add the parameterization as asked below. I wasn't sure if you meant I should add new tests in addition to test_uint64_type_handling
@@ -313,3 +313,14 @@ def test_order_aggregate_multiple_funcs(): | |||
expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min']) | |||
|
|||
tm.assert_index_equal(result, expected) | |||
|
|||
|
|||
def test_uint64_type_handling(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize on first, last. Does this also deal with min, max? if so pls add parameterizeation as well. Anything else?
# GH 26310 | ||
df1 = pd.DataFrame({'x': 6903052872240755750, 'y': [1, 2]}) | ||
df1.groupby('y').agg({'x': 'first'}) | ||
df2 = df1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
result =
expected =
tm.assert_frame_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
df1.groupby('y').agg({'x': 'first'}) | ||
df2 = df1 | ||
df2.x = df2.x.astype(np.uint64) | ||
df2.groupby('y').agg({'x': 'first'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are going to want to parameterize this on an int64 as well (so another later of paramaterization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Codecov Report
@@ Coverage Diff @@
## master #26359 +/- ##
==========================================
- Coverage 92.04% 92.03% -0.01%
==========================================
Files 175 175
Lines 52289 52292 +3
==========================================
- Hits 48130 48129 -1
- Misses 4159 4163 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26359 +/- ##
==========================================
- Coverage 91.73% 91.72% -0.01%
==========================================
Files 174 174
Lines 50741 50746 +5
==========================================
+ Hits 46548 46549 +1
- Misses 4193 4197 +4
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -403,6 +403,7 @@ Groupby/Resample/Rolling | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`) | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`) | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`) | |||
- Bug in :meth:`pandas.core.groupby.BaseGrouper._cython_operation` where incorrect results are returned for uint64 columns. (:issue:`26310`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatsnew messages should be user-facing, i.e. only references items exposed via the API. This isn't, but I think you can refactor to reference pandas.core.groupby.GroupBy.agg
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
expected = df.groupby('y').agg({'x': how}) | ||
df.x = df.x.astype(dtype) | ||
result = df.groupby('y').agg({'x': how}) | ||
result.x = result.x.astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test coverage for a value that exceeds the upper limit of an int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately clear to me what we would be testing for in that case. Issue #26310 is basically that there is some method f
such that f(x) != f(y)
, even though x == y
, when x
is of type np.int64
and y
is of type np.uint64
.
If the value of y
would exceed the upper limit of np.int64
you couldn't represent it as an np.int64
(?) and I'm not sure how you could pick x
in that case to produce the above situation.
If I have misunderstood, please elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is going to be with unsigned integers in the range of 263 through 264-1 (range where uint64 exceeds int64). I'm not sure if that would even work with these agg functions and don't want to open up a Pandora's box here but if that doesn't work we might just need to reword the whatsnew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest rewording whatsnew here and dealing with that in another pull request. What do you think @WillAyd? How would you like the whatsnew to be worded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, input uint64's should work with your code change, but @mahepe ok to make an issue about this.
|
||
|
||
@pytest.mark.parametrize('dtype', [np.int64, np.uint64]) | ||
@pytest.mark.parametrize('how', ['first', 'last', 'min', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity how did you select these methods? The issue would cover more than just these right?
Just asking as at some point we should probably add a shared fixture for the different aggregation / transformation methods. Probably a separate PR but just curious if this subset is intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This subset is not intentional.
I tried to select a set that would satisfy the requests by the other reviewer. Since this PR alters a pretty generic method, I'm sure you could test for a way larger set of methods. It's just that I don't know the codebase well enough to design such tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. pls merge master; there is a conflict in the whatsnew. some doc checks are failing as well. ping on green.
pandas/core/dtypes/common.py
Outdated
@@ -107,9 +108,18 @@ def ensure_int64_or_float64(arr, copy=False): | |||
out_arr : The input array cast as int64 if | |||
possible without overflow. | |||
Otherwise the input array cast to float64. | |||
|
|||
Notes | |||
------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the underline should be the same length as Notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
expected = df.groupby('y').agg({'x': how}) | ||
df.x = df.x.astype(dtype) | ||
result = df.groupby('y').agg({'x': how}) | ||
result.x = result.x.astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, input uint64's should work with your code change, but @mahepe ok to make an issue about this.
all green @jreback ! |
thanks @mahepe |
git diff upstream/master -u -- "*.py" | flake8 --diff
A way to avoid incorrect coercion to float64 in the case of uint64 input is to introduce an additional check to
ensure_int64_or_float64