Skip to content

PERF: pandas 0.15.2 multi-indexed DataFrame sum  #9049

Closed
@xdliao

Description

@xdliao

Problem:
data.sum(level=...) for multi-index table produce different result (lots of NAs) than groupby
in certain cases. It's also much slower than groupby. Seems that the new version
produced a cross join of the keys and produce NAs for pair of keys with no data, which makes
the result bigger and significantly slower.
data.groupby(level=...).sum(). This happens in the following example:

Code:

import pandas as pd
print "-------------- pandas version: ", pd.__version__
max_num_of_syms = 4000
list_of_df = []
for i,a in enumerate(pd.Series(range(100)).astype(str)):
   #Each 'A' has difference number of 'B' entries in order to produce the problem
    num_of_syms = int(i*max_num_of_syms/100.0)# if i<3 else max_num_of_syms
    #print num_of_syms
    d = pd.DataFrame({'A': [a]*num_of_syms , 'B': pd.Series(range(num_of_syms)).astype(str), 'C':1})
    list_of_df.append(d)
data = pd.concat(list_of_df).set_index(['A','B'])


%time a= data.sum(level=['A','B'])
print a.shape
#This is a lot faster
%time a= data.reset_index().groupby(['A','B']).sum()
print a.shape

-------------- pandas version: 0.15.1.dev
CPU times: user 876 ms, sys: 17 ms, total: 893 ms
Wall time: 894 ms
(392040, 1)
CPU times: user 109 ms, sys: 0 ns, total: 109 ms
Wall time: 108 ms
(198000, 1)

-------------- pandas version: 0.14.1
CPU times: user 94 ms, sys: 0 ns, total: 94 ms
Wall time: 94.2 ms
(198000, 1)
CPU times: user 120 ms, sys: 0 ns, total: 120 ms
Wall time: 120 ms (198000, 1)

Metadata

Metadata

Assignees

Labels

Numeric OperationsArithmetic, Comparison, and Logical operationsPerformanceMemory or execution speed performance

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions