BUG: groupby.agg returns incorrect results for uint64 cols (#26310) #26359

mahepe · 2019-05-12T19:57:36Z

closes groupby.agg (first, last, min, etc...) returns incorrect results for uint64 columns #26310
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

A way to avoid incorrect coercion to float64 in the case of uint64 input is to introduce an additional check to ensure_int64_or_float64

jreback

thanks. this needs a bit of testing; my comments should be straightforward to handle.

jreback · 2019-05-12T20:25:40Z

doc/source/whatsnew/v0.25.0.rst

@@ -258,6 +258,7 @@ Performance Improvements
 Bug Fixes
 ~~~~~~~~~

+- Bug where groupby.agg (first, last, min, etc...) returns incorrect results for uint64 columns. (:issue:`26310`)


move to grouping use full references (see other issues on how to do this)

pandas/core/dtypes/common.py

jreback · 2019-05-12T20:28:23Z

pandas/core/dtypes/common.py

    """
    try:
        return arr.astype('int64', copy=copy, casting='safe')
    except TypeError:
-        return arr.astype('float64', copy=copy)
+        try:


use a pass rather than a nested try/except

pandas/core/dtypes/common.py

jreback · 2019-05-12T20:28:54Z

pandas/core/dtypes/common.py

@@ -90,9 +90,11 @@ def ensure_categorical(arr):
 def ensure_int64_or_float64(arr, copy=False):


can you type this (ArrayLike) -> np.array

pandas/core/dtypes/common.py

jreback · 2019-05-12T20:30:06Z

pandas/tests/groupby/aggregate/test_aggregate.py

@@ -313,3 +313,14 @@ def test_order_aggregate_multiple_funcs():
    expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min'])

    tm.assert_index_equal(result, expected)
+
+
+def test_uint64_type_handling():


can you add a battery of tests in pandas/tests/dtypes/test_common.py I don't think we have this tested at all.

I did add the parameterization as asked below. I wasn't sure if you meant I should add new tests in addition to test_uint64_type_handling

jreback · 2019-05-12T20:31:12Z

pandas/tests/groupby/aggregate/test_aggregate.py

@@ -313,3 +313,14 @@ def test_order_aggregate_multiple_funcs():
    expected = pd.Index(['sum', 'max', 'mean', 'ohlc', 'min'])

    tm.assert_index_equal(result, expected)
+
+
+def test_uint64_type_handling():


can you parameterize on first, last. Does this also deal with min, max? if so pls add parameterizeation as well. Anything else?

jreback · 2019-05-12T20:31:36Z

pandas/tests/groupby/aggregate/test_aggregate.py

+    # GH 26310
+    df1 = pd.DataFrame({'x': 6903052872240755750, 'y': [1, 2]})
+    df1.groupby('y').agg({'x': 'first'})
+    df2 = df1


use

result =
expected =
tm.assert_frame_equal(result, expected)

jreback · 2019-05-12T20:32:15Z

pandas/tests/groupby/aggregate/test_aggregate.py

+    df1.groupby('y').agg({'x': 'first'})
+    df2 = df1
+    df2.x = df2.x.astype(np.uint64)
+    df2.groupby('y').agg({'x': 'first'})


we are going to want to parameterize this on an int64 as well (so another later of paramaterization)

codecov · 2019-05-12T20:36:55Z

Codecov Report

Merging #26359 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26359      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         175      175              
  Lines       52289    52292       +3     
==========================================
- Hits        48130    48129       -1     
- Misses       4159     4163       +4

Flag	Coverage Δ
#multiple	`90.59% <100%> (ø)`	⬆️
#single	`40.7% <0%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/common.py	`97.43% <100%> (+0.02%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e4f5bd...2e9ca6b. Read the comment docs.

codecov · 2019-05-12T20:36:55Z

Codecov Report

Merging #26359 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26359      +/-   ##
==========================================
- Coverage   91.73%   91.72%   -0.01%     
==========================================
  Files         174      174              
  Lines       50741    50746       +5     
==========================================
+ Hits        46548    46549       +1     
- Misses       4193     4197       +4

Flag	Coverage Δ
#multiple	`90.23% <100%> (ø)`	⬆️
#single	`41.7% <28.57%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/common.py	`97.45% <100%> (+0.04%)`	⬆️
pandas/core/groupby/ops.py	`95.96% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.02% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1263e1a...ce22d54. Read the comment docs.

pep8speaks · 2019-05-13T19:31:24Z

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-16 18:40:25 UTC

pandas/core/dtypes/common.py

WillAyd · 2019-05-13T20:01:50Z

doc/source/whatsnew/v0.25.0.rst

@@ -403,6 +403,7 @@ Groupby/Resample/Rolling
 - Bug in :meth:`pandas.core.groupby.GroupBy.idxmax` and :meth:`pandas.core.groupby.GroupBy.idxmin` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)
 - Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
 - Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)
+- Bug in :meth:`pandas.core.groupby.BaseGrouper._cython_operation` where incorrect results are returned for uint64 columns. (:issue:`26310`)


Whatsnew messages should be user-facing, i.e. only references items exposed via the API. This isn't, but I think you can refactor to reference pandas.core.groupby.GroupBy.agg instead

fixed, thanks!

WillAyd · 2019-05-13T20:02:29Z

pandas/tests/groupby/aggregate/test_aggregate.py

+    expected = df.groupby('y').agg({'x': how})
+    df.x = df.x.astype(dtype)
+    result = df.groupby('y').agg({'x': how})
+    result.x = result.x.astype(np.int64)


Can we add test coverage for a value that exceeds the upper limit of an int64?

It's not immediately clear to me what we would be testing for in that case. Issue #26310 is basically that there is some method f such that f(x) != f(y), even though x == y, when x is of type np.int64 and y is of type np.uint64.

If the value of y would exceed the upper limit of np.int64 you couldn't represent it as an np.int64 (?) and I'm not sure how you could pick x in that case to produce the above situation.

If I have misunderstood, please elaborate.

The problem here is going to be with unsigned integers in the range of 263 through 264-1 (range where uint64 exceeds int64). I'm not sure if that would even work with these agg functions and don't want to open up a Pandora's box here but if that doesn't work we might just need to reword the whatsnew

I suggest rewording whatsnew here and dealing with that in another pull request. What do you think @WillAyd? How would you like the whatsnew to be worded?

hmm, input uint64's should work with your code change, but @mahepe ok to make an issue about this.

WillAyd · 2019-05-13T20:05:54Z

pandas/tests/groupby/aggregate/test_aggregate.py

+
+
+@pytest.mark.parametrize('dtype', [np.int64, np.uint64])
+@pytest.mark.parametrize('how', ['first', 'last', 'min',


Just out of curiosity how did you select these methods? The issue would cover more than just these right?

Just asking as at some point we should probably add a shared fixture for the different aggregation / transformation methods. Probably a separate PR but just curious if this subset is intentional

This subset is not intentional.

I tried to select a set that would satisfy the requests by the other reviewer. Since this PR alters a pretty generic method, I'm sure you could test for a way larger set of methods. It's just that I don't know the codebase well enough to design such tests.

jreback

lgtm. pls merge master; there is a conflict in the whatsnew. some doc checks are failing as well. ping on green.

jreback · 2019-05-16T00:17:30Z

pandas/core/dtypes/common.py

@@ -107,9 +108,18 @@ def ensure_int64_or_float64(arr, copy=False):
    out_arr : The input array cast as int64 if
              possible without overflow.
              Otherwise the input array cast to float64.
+
+    Notes
+    -------


the underline should be the same length as Notes

jreback · 2019-05-16T00:20:49Z

pandas/tests/groupby/aggregate/test_aggregate.py

+    expected = df.groupby('y').agg({'x': how})
+    df.x = df.x.astype(dtype)
+    result = df.groupby('y').agg({'x': how})
+    result.x = result.x.astype(np.int64)


hmm, input uint64's should work with your code change, but @mahepe ok to make an issue about this.

…v#26310)

mahepe · 2019-05-17T05:10:47Z

all green @jreback !

jreback · 2019-05-18T14:45:37Z

thanks @mahepe

gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels May 12, 2019

jreback requested changes May 12, 2019

View reviewed changes

mahepe force-pushed the groupby-agg-bug branch from 2e9ca6b to 683e5f6 Compare May 13, 2019 19:31

mahepe force-pushed the groupby-agg-bug branch from ef0dd1e to a218943 Compare May 13, 2019 19:34

WillAyd requested changes May 13, 2019

View reviewed changes

mahepe force-pushed the groupby-agg-bug branch from a218943 to ca64ba4 Compare May 14, 2019 18:02

jreback added this to the 0.25.0 milestone May 16, 2019

jreback requested changes May 16, 2019

View reviewed changes

mahepe force-pushed the groupby-agg-bug branch from ca64ba4 to 512ad2c Compare May 16, 2019 16:56

BUG: groupby.agg returns incorrect results for uint64 cols (pandas-de…

ce22d54

…v#26310)

mahepe force-pushed the groupby-agg-bug branch from 512ad2c to ce22d54 Compare May 16, 2019 18:40

WillAyd approved these changes May 18, 2019

View reviewed changes

jreback approved these changes May 18, 2019

View reviewed changes

jreback merged commit 07cbadc into pandas-dev:master May 18, 2019

		@@ -90,9 +90,11 @@ def ensure_categorical(arr):
		def ensure_int64_or_float64(arr, copy=False):



		@pytest.mark.parametrize('dtype', [np.int64, np.uint64])
		@pytest.mark.parametrize('how', ['first', 'last', 'min',

Uh oh!

BUG: groupby.agg returns incorrect results for uint64 cols (#26310) #26359

BUG: groupby.agg returns incorrect results for uint64 cols (#26310) #26359

Uh oh!

Conversation

mahepe commented May 12, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 12, 2019

Codecov Report

Uh oh!

codecov bot commented May 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pep8speaks commented May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-05-16 18:40:25 UTC

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahepe May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 12, 2019 •

edited

Loading

pep8speaks commented May 13, 2019 •

edited

Loading

mahepe May 14, 2019 •

edited

Loading