Skip to content

DataFrame.iterrows() breaks timezone on index #8951

Closed
@JackKelly

Description

@JackKelly

Duplicate of #8890.

As far as I can tell, the Timestamps for the index generated by iterrows() are 5 hours behind where they should be in this example:

In [33]: idx = pd.date_range("2010-01-01 00:00:00-0500", freq='D', periods=3)

In [34]: df = pd.DataFrame([1,2,3], index=[idx])

In [35]: df # this looks correct
Out[35]: 
                           0
2010-01-01 00:00:00-05:00  1
2010-01-02 00:00:00-05:00  2
2010-01-03 00:00:00-05:00  3

In [36]: [index for index, row in df.iterrows()] # but this looks wrong:
Out[36]: 
[Timestamp('2009-12-31 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'),
 Timestamp('2010-01-01 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'),
 Timestamp('2010-01-02 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D')]

I would have expected iterrows() to produce the same indices as this code:

In [38]: [df.index[i] for i in range(len(df))]
Out[38]: 
[Timestamp('2010-01-01 00:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'),
 Timestamp('2010-01-02 00:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'),
 Timestamp('2010-01-03 00:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D')]

The row.name is also incorrect:

In [37]: [row.name for index, row in df.iterrows()]
Out[37]: 
[Timestamp('2009-12-31 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'),
 Timestamp('2010-01-01 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'),
 Timestamp('2010-01-02 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D')]

But all is fine if we use a geographical timezone instead of a pytz.FixedOffset:

In [47]: idx = pd.date_range("2010-01-01 00:00:00", freq='D', periods=3, tz="America/New_York")

In [48]: df = pd.DataFrame([1,2,3], index=[idx])

In [49]: [index for index, row in df.iterrows()]
Out[49]: 
[Timestamp('2010-01-01 00:00:00-0456', tz='America/New_York', offset='D'),
 Timestamp('2010-01-02 00:00:00-0456', tz='America/New_York', offset='D'),
 Timestamp('2010-01-03 00:00:00-0456', tz='America/New_York', offset='D')]

In [50]: df
Out[50]: 
                           0
2010-01-01 00:00:00-04:56  1
2010-01-02 00:00:00-04:56  2
2010-01-03 00:00:00-04:56  3

Forgive me if I am using Pandas incorrectly!

Versions:


In [51]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-25-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.15.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.8.2
scipy: 0.14.0
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: None
dateutil: 2.2
pytz: 2014.10
bottleneck: 0.6.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: 1.8.6
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.6
bs4: None
html5lib: 0.999
httplib2: 0.9
apiclient: None
rpy2: 2.3.8
sqlalchemy: None
pymysql: None
psycopg2: None

(it goes without saying that I'm a huge fan of Pandas!)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions