Skip to content

BUG: groupby.agg returns incorrect results for uint64 cols (#26310) #26359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 18, 2019

Conversation

mahepe
Copy link
Contributor

@mahepe mahepe commented May 12, 2019

A way to avoid incorrect coercion to float64 in the case of uint64 input is to introduce an additional check to ensure_int64_or_float64

@gfyoung gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels May 12, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. this needs a bit of testing; my comments should be straightforward to handle.

@@ -258,6 +258,7 @@ Performance Improvements
Bug Fixes
~~~~~~~~~

- Bug where groupby.agg (first, last, min, etc...) returns incorrect results for uint64 columns. (:issue:`26310`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to grouping use full references (see other issues on how to do this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
try:
return arr.astype('int64', copy=copy, casting='safe')
except TypeError:
return arr.astype('float64', copy=copy)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a pass rather than a nested try/except

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -90,9 +90,11 @@ def ensure_categorical(arr):
def ensure_int64_or_float64(arr, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this (ArrayLike) -> np.array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -313,3 +313,14 @@ def test_order_aggregate_multiple_funcs():
expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min'])

tm.assert_index_equal(result, expected)


def test_uint64_type_handling():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a battery of tests in pandas/tests/dtypes/test_common.py I don't think we have this tested at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add the parameterization as asked below. I wasn't sure if you meant I should add new tests in addition to test_uint64_type_handling

@@ -313,3 +313,14 @@ def test_order_aggregate_multiple_funcs():
expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min'])

tm.assert_index_equal(result, expected)


def test_uint64_type_handling():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize on first, last. Does this also deal with min, max? if so pls add parameterizeation as well. Anything else?

# GH 26310
df1 = pd.DataFrame({'x': 6903052872240755750, 'y': [1, 2]})
df1.groupby('y').agg({'x': 'first'})
df2 = df1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

result =
expected =
tm.assert_frame_equal(result, expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

df1.groupby('y').agg({'x': 'first'})
df2 = df1
df2.x = df2.x.astype(np.uint64)
df2.groupby('y').agg({'x': 'first'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going to want to parameterize this on an int64 as well (so another later of paramaterization)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@codecov
Copy link

codecov bot commented May 12, 2019

Codecov Report

Merging #26359 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26359      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         175      175              
  Lines       52289    52292       +3     
==========================================
- Hits        48130    48129       -1     
- Misses       4159     4163       +4
Flag Coverage Δ
#multiple 90.59% <100%> (ø) ⬆️
#single 40.7% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/common.py 97.43% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e4f5bd...2e9ca6b. Read the comment docs.

@codecov
Copy link

codecov bot commented May 12, 2019

Codecov Report

Merging #26359 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26359      +/-   ##
==========================================
- Coverage   91.73%   91.72%   -0.01%     
==========================================
  Files         174      174              
  Lines       50741    50746       +5     
==========================================
+ Hits        46548    46549       +1     
- Misses       4193     4197       +4
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.7% <28.57%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/common.py 97.45% <100%> (+0.04%) ⬆️
pandas/core/groupby/ops.py 95.96% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1263e1a...ce22d54. Read the comment docs.

@mahepe mahepe force-pushed the groupby-agg-bug branch from 2e9ca6b to 683e5f6 Compare May 13, 2019 19:31
@pep8speaks
Copy link

pep8speaks commented May 13, 2019

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-16 18:40:25 UTC

@mahepe mahepe force-pushed the groupby-agg-bug branch from ef0dd1e to a218943 Compare May 13, 2019 19:34
@@ -403,6 +403,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)
- Bug in :meth:`pandas.core.groupby.BaseGrouper._cython_operation` where incorrect results are returned for uint64 columns. (:issue:`26310`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatsnew messages should be user-facing, i.e. only references items exposed via the API. This isn't, but I think you can refactor to reference pandas.core.groupby.GroupBy.agg instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!

expected = df.groupby('y').agg({'x': how})
df.x = df.x.astype(dtype)
result = df.groupby('y').agg({'x': how})
result.x = result.x.astype(np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test coverage for a value that exceeds the upper limit of an int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately clear to me what we would be testing for in that case. Issue #26310 is basically that there is some method f such that f(x) != f(y), even though x == y, when x is of type np.int64 and y is of type np.uint64.

If the value of y would exceed the upper limit of np.int64 you couldn't represent it as an np.int64 (?) and I'm not sure how you could pick x in that case to produce the above situation.

If I have misunderstood, please elaborate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is going to be with unsigned integers in the range of 263 through 264-1 (range where uint64 exceeds int64). I'm not sure if that would even work with these agg functions and don't want to open up a Pandora's box here but if that doesn't work we might just need to reword the whatsnew

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest rewording whatsnew here and dealing with that in another pull request. What do you think @WillAyd? How would you like the whatsnew to be worded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, input uint64's should work with your code change, but @mahepe ok to make an issue about this.



@pytest.mark.parametrize('dtype', [np.int64, np.uint64])
@pytest.mark.parametrize('how', ['first', 'last', 'min',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity how did you select these methods? The issue would cover more than just these right?

Just asking as at some point we should probably add a shared fixture for the different aggregation / transformation methods. Probably a separate PR but just curious if this subset is intentional

Copy link
Contributor Author

@mahepe mahepe May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subset is not intentional.

I tried to select a set that would satisfy the requests by the other reviewer. Since this PR alters a pretty generic method, I'm sure you could test for a way larger set of methods. It's just that I don't know the codebase well enough to design such tests.

@mahepe mahepe force-pushed the groupby-agg-bug branch from a218943 to ca64ba4 Compare May 14, 2019 18:02
@jreback jreback added this to the 0.25.0 milestone May 16, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. pls merge master; there is a conflict in the whatsnew. some doc checks are failing as well. ping on green.

@@ -107,9 +108,18 @@ def ensure_int64_or_float64(arr, copy=False):
out_arr : The input array cast as int64 if
possible without overflow.
Otherwise the input array cast to float64.

Notes
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the underline should be the same length as Notes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expected = df.groupby('y').agg({'x': how})
df.x = df.x.astype(dtype)
result = df.groupby('y').agg({'x': how})
result.x = result.x.astype(np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, input uint64's should work with your code change, but @mahepe ok to make an issue about this.

@mahepe mahepe force-pushed the groupby-agg-bug branch from ca64ba4 to 512ad2c Compare May 16, 2019 16:56
@mahepe mahepe force-pushed the groupby-agg-bug branch from 512ad2c to ce22d54 Compare May 16, 2019 18:40
@mahepe
Copy link
Contributor Author

mahepe commented May 17, 2019

all green @jreback !

@jreback jreback merged commit 07cbadc into pandas-dev:master May 18, 2019
@jreback
Copy link
Contributor

jreback commented May 18, 2019

thanks @mahepe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby.agg (first, last, min, etc...) returns incorrect results for uint64 columns
5 participants