Skip to content

BUG: Issues with groupby ewm and times #40951

Closed
@stevenschaerer

Description

@stevenschaerer
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

This refers to the code that is currently on master 84d9c5e (2021-04-14). The issues also exist on the latest version of pandas but are different.

import pandas as pd

halflife = "23 days"
baseline_df = pd.DataFrame(
    {
        "A": ["a", "b", "a", "b", "a", "b"],
        "B": [0, 0, 1, 1, 2, 2],
        "C": pd.to_datetime(
            [
                "2020-01-01",
                "2020-01-01",
                "2020-01-10",
                "2020-01-02",
                "2020-01-23",
                "2020-01-03",
            ]
        )
    }
)

cython_result = baseline_df.groupby("A").ewm(halflife=halflife, times="C").mean()
print("cython")
print(cython_result)
print("numba")
numba_result = baseline_df.groupby("A").ewm(halflife=halflife, times="C").mean(engine="numba")
print(numba_result)

expected_result_a = pd.DataFrame([0, 1, 2]).ewm(
    halflife=halflife, times=pd.to_datetime(["2020-01-01", "2020-01-10", "2020-01-23"])
).mean()
expected_result_b = pd.DataFrame([0, 1, 2]).ewm(
    halflife=halflife, times=pd.to_datetime(["2020-01-01", "2020-01-02", "2020-01-03"])
).mean()
print("expected")
print("  group a")
print(expected_result_a)
print("  group b")
print(expected_result_b)

Output:

cython
            B
A            
a 0  0.000000
  2  0.500000
  4  1.094088
b 1  0.000000
  3  0.500000
  5  1.094088
numba
            B
A            
a 0  0.000000
  2  0.666667
  4  1.428571
b 1  0.000000
  3  0.666667
  5  1.428571
expected
  group a
          0
0  0.000000
1  0.567395
2  1.221209
  group b
          0
0  0.000000
1  0.507534
2  1.020088

Problem description

There are three problems with the current groupby ewm implementation in the case of non-None times.

  1. numba implementation: ignores the times
  2. cython implementation: does not use the correct times/deltas in aggregations.pyx in case of multiple groups
  3. if the groups are non-trivial the time vector and values become out of sync

I have a branch that fixes these issues, will link to it in a bit.

Expected Output

cython
            B
A            
a 0  0.000000
  2  0.567395
  4  1.221209
b 1  0.000000
  3  0.507534
  5  1.020088
numba
            B
A            
a 0  0.000000
  2  0.567395
  4  1.221209
b 1  0.000000
  3  0.507534
  5  1.020088
expected
  group a
          0
0  0.000000
1  0.567395
2  1.221209
  group b
          0
0  0.000000
1  0.507534
2  1.020088

Output of pd.show_versions()

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugWindowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions