Skip to content

DataFrame.corrwith has unintuitive behavior #22328

Closed
@dsaxton

Description

@dsaxton
import numpy as np
import pandas as pd

np.random.seed(2357)

df1 = pd.DataFrame(np.random.normal(size=(100, 3)))
df2 = pd.DataFrame(np.random.normal(size=(100, 3)))

# not clear what this means
df1.corrwith(df2)
# simply renaming columns produces nans
df2.columns = ["a", "b", "c"]
df1.corrwith(df2)
# this should arguably be the same as df1.corr()
df1.corrwith(df1)

Problem description

The docstring for corrwith describes its behavior as:

Compute pairwise correlation between rows or columns of two DataFrame
objects.

My interpretation of this would be that corrwith finds all pairwise correlations between the two DataFrames along the given axis (it's also defined for Series input which should probably be a bit more explicit in the docstring). That is (for axis=0) df1.corrwith(df2) would be functionally equivalent to

df2.apply(lambda x: df1.corrwith(x))

However, as implemented the method returns a single Series which appears to give specific pairwise correlations, and how the pairs are chosen does not appear to be documented.

Defining corrwith recursively as above seems more useful (you still retain all the information from the current implementation) and interpretable, fixes the apparent bug caused when the method tries to align columns, and also allows for an easy way of closing issue #21925 (since ultimately the method falls back on Series.corr which permits control of the correlation type). I can start to work on a pull request if others agree that this would be preferred.

Expected Output

# df1.corrwith(df2) should be equal to this in my opinion
df2.apply(lambda x: df1.corrwith(x))
# this yields the ordinary correlation matrix as expected, although it is slower than just df1.corr()
df1.apply(lambda x: df1.corrwith(x))

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: 3.7.1
pip: 18.0
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions