Description
Code Sample, a copy-pastable example if possible
pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')
Problem description
df1:
a | b | |
---|---|---|
0 | 1 | 4.0 |
1 | 2 | 5.0 |
2 | 3 | NaN |
df2:
c | d | |
---|---|---|
0 | 6 | 4.0 |
1 | 7 | NaN |
2 | 8 | NaN |
3 | 9 | 5.0 |
Current output:
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 4.0 | 6 | 4.0 |
1 | 2 | 5.0 | 9 | 5.0 |
2 | 3 | NaN | 7 | NaN |
3 | 3 | NaN | 8 | NaN |
Expected Output
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 4.0 | 6 | 4.0 |
1 | 2 | 5.0 | 9 | 5.0 |
What's happening is the NaN is df1.b
is matching the NaNs in df2.d
.
I don't see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.
What do you think?
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None