Skip to content

BUG: {expanding,rolling}_{cov,corr} functions between objects with different index sets #7512

Closed
@seth-p

Description

@seth-p

related #7514

There appears to be a bug in the expanding_{cov,corr} functions when dealing with two objects with different indexes.

First, there is a problem with series. See example below, where I would expect expanding_corr(s1, s2) to produce the result produced by expanding_corr(s1, s2a).

The problem is due to the fact that expanding_corr is implemented in terms of rolling_corr with window = max(len(arg1), len(arg2)), but then rolling_corr resets window to window = min(window, len(arg1), len(arg2)). The end result is that window = min(len(arg1), len(arg2)) -- and these are the raw, unaligned arg1 and arg2. Thus in the expanding_corr(s1, s2) example below, window=2, and so when calculating the third row (index=2) it tries to calculate the correlation between [2, 3] and [NaN, 3], producing NaN -- rather than calculating the correlation between [1, 2, 3] and [1, Nan, 3] and producing 1.

The solution would appear to be simply deleting the window = min(window, len(arg1), len(arg2)) line from rolling_cov and rolling_corr, as I believe the rolling_* functions run fine with a window larger than the data, or at least replacing it with window = min(window, max(len(arg1), len(arg2))).

In [1]: from pandas import Series, expanding_corr

In [2]: s1 = Series([1, 2, 3], index=[0, 1, 2])

In [3]: s2 = Series([1, 3], index=[0, 2])

In [4]: expanding_corr(s1, s2)
Out[4]:
0   NaN
1   NaN
2   NaN
dtype: float64

In [5]: s2a = Series([1, None, 3], index=[0, 1, 2])

In [6]: expanding_corr(s1, s2a)
Out[6]:
0   NaN
1   NaN
2     1
dtype: float64

Next, there is a problem with data frames. [This was originally reported separately in https://github.com//issues/7512, but I've merged it into this issue.]

The problem is with with _flex_binary_moment(). When pairwise=True, it doesn't properly handle two DataFrames with different index sets. In the following example, I believe [6], [7], and [8] should all produce the result in [9].

In [1]: from pandas import DataFrame, expanding_corr

In [2]: df1 = DataFrame([[1,2], [3, 2], [3,4]], columns=['A','B'])

In [3]: df1a = DataFrame([[1,2], [3,4]], columns=['A','B'], index=[0,2])

In [4]: df2 = DataFrame([[5,6], [None,None], [2,1]], columns=['X','Y'])

In [5]: df2a = DataFrame([[5,6], [2,1]], columns=['X','Y'], index=[0,2])

In [6]: expanding_corr(df1, df2, pairwise=True)[2]
Out[6]:
          X         Y
A -1.224745 -1.224745
B -1.224745 -1.224745

In [7]: expanding_corr(df1, df2a, pairwise=True)[2]
Out[7]:
    X   Y
A NaN NaN
B NaN NaN

In [8]: expanding_corr(df1a, df2, pairwise=True)[2]
Out[8]:
    X   Y
A NaN NaN
B NaN NaN

In [9]: expanding_corr(df1a, df2a, pairwise=True)[2]
Out[9]:
   X  Y
A -1 -1
B -1 -1

And there are similar problems with rolling_cov and rolling_corr. For example, continuing with the previous example, [77], [78], and [79] should give the same result as [80].

In [77]: rolling_corr(df1, df2, window=3, pairwise=True, min_periods=2)[2]
Out[77]:
          X         Y
A -1.224745 -1.224745
B -1.224745 -1.224745

In [78]: rolling_corr(df1, df2a, window=3, pairwise=True, min_periods=2)[2]
Out[78]:
    X   Y
A NaN NaN
B NaN NaN

In [79]: rolling_corr(df1a, df2, window=3, pairwise=True, min_periods=2)[2]
Out[79]:
    X   Y
A NaN NaN
B NaN NaN

In [80]: rolling_corr(df1a, df2a, window=3, pairwise=True, min_periods=2)[2]
Out[80]:
   X  Y
A -1 -1
B -1 -1

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNumeric OperationsArithmetic, Comparison, and Logical operations

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions