Description
related #7514
There appears to be a bug in the expanding_{cov,corr} functions when dealing with two objects with different indexes.
First, there is a problem with series. See example below, where I would expect expanding_corr(s1, s2) to produce the result produced by expanding_corr(s1, s2a).
The problem is due to the fact that expanding_corr is implemented in terms of rolling_corr with window = max(len(arg1), len(arg2)), but then rolling_corr resets window to window = min(window, len(arg1), len(arg2)). The end result is that window = min(len(arg1), len(arg2)) -- and these are the raw, unaligned arg1 and arg2. Thus in the expanding_corr(s1, s2) example below, window=2, and so when calculating the third row (index=2) it tries to calculate the correlation between [2, 3] and [NaN, 3], producing NaN -- rather than calculating the correlation between [1, 2, 3] and [1, Nan, 3] and producing 1.
The solution would appear to be simply deleting the window = min(window, len(arg1), len(arg2)) line from rolling_cov and rolling_corr, as I believe the rolling_* functions run fine with a window larger than the data, or at least replacing it with window = min(window, max(len(arg1), len(arg2))).
In [1]: from pandas import Series, expanding_corr
In [2]: s1 = Series([1, 2, 3], index=[0, 1, 2])
In [3]: s2 = Series([1, 3], index=[0, 2])
In [4]: expanding_corr(s1, s2)
Out[4]:
0 NaN
1 NaN
2 NaN
dtype: float64
In [5]: s2a = Series([1, None, 3], index=[0, 1, 2])
In [6]: expanding_corr(s1, s2a)
Out[6]:
0 NaN
1 NaN
2 1
dtype: float64
Next, there is a problem with data frames. [This was originally reported separately in https://github.com//issues/7512, but I've merged it into this issue.]
The problem is with with _flex_binary_moment(). When pairwise=True, it doesn't properly handle two DataFrames with different index sets. In the following example, I believe [6], [7], and [8] should all produce the result in [9].
In [1]: from pandas import DataFrame, expanding_corr
In [2]: df1 = DataFrame([[1,2], [3, 2], [3,4]], columns=['A','B'])
In [3]: df1a = DataFrame([[1,2], [3,4]], columns=['A','B'], index=[0,2])
In [4]: df2 = DataFrame([[5,6], [None,None], [2,1]], columns=['X','Y'])
In [5]: df2a = DataFrame([[5,6], [2,1]], columns=['X','Y'], index=[0,2])
In [6]: expanding_corr(df1, df2, pairwise=True)[2]
Out[6]:
X Y
A -1.224745 -1.224745
B -1.224745 -1.224745
In [7]: expanding_corr(df1, df2a, pairwise=True)[2]
Out[7]:
X Y
A NaN NaN
B NaN NaN
In [8]: expanding_corr(df1a, df2, pairwise=True)[2]
Out[8]:
X Y
A NaN NaN
B NaN NaN
In [9]: expanding_corr(df1a, df2a, pairwise=True)[2]
Out[9]:
X Y
A -1 -1
B -1 -1
And there are similar problems with rolling_cov and rolling_corr. For example, continuing with the previous example, [77], [78], and [79] should give the same result as [80].
In [77]: rolling_corr(df1, df2, window=3, pairwise=True, min_periods=2)[2]
Out[77]:
X Y
A -1.224745 -1.224745
B -1.224745 -1.224745
In [78]: rolling_corr(df1, df2a, window=3, pairwise=True, min_periods=2)[2]
Out[78]:
X Y
A NaN NaN
B NaN NaN
In [79]: rolling_corr(df1a, df2, window=3, pairwise=True, min_periods=2)[2]
Out[79]:
X Y
A NaN NaN
B NaN NaN
In [80]: rolling_corr(df1a, df2a, window=3, pairwise=True, min_periods=2)[2]
Out[80]:
X Y
A -1 -1
B -1 -1