PERF: pandas 0.15.2 multi-indexed DataFrame sum 

Problem: 
data.sum(level=...) for multi-index table produce different result (lots of NAs) than groupby
in certain cases. It's also much slower than groupby. Seems that the new version
produced a cross join of the keys and produce NAs for pair of keys with no data, which makes
the result bigger and significantly slower. 
data.groupby(level=...).sum(). This happens in the following example:

Code: 

``` python

import pandas as pd
print "-------------- pandas version: ", pd.__version__
max_num_of_syms = 4000
list_of_df = []
for i,a in enumerate(pd.Series(range(100)).astype(str)):
   #Each 'A' has difference number of 'B' entries in order to produce the problem
    num_of_syms = int(i*max_num_of_syms/100.0)# if i<3 else max_num_of_syms
    #print num_of_syms
    d = pd.DataFrame({'A': [a]*num_of_syms , 'B': pd.Series(range(num_of_syms)).astype(str), 'C':1})
    list_of_df.append(d)
data = pd.concat(list_of_df).set_index(['A','B'])


%time a= data.sum(level=['A','B'])
print a.shape
#This is a lot faster
%time a= data.reset_index().groupby(['A','B']).sum()
print a.shape
```

-------------- pandas version:  0.15.1.dev
CPU times: user 876 ms, sys: 17 ms, total: 893 ms
Wall time: 894 ms
(392040, 1)
CPU times: user 109 ms, sys: 0 ns, total: 109 ms
Wall time: 108 ms
(198000, 1)

-------------- pandas version: 0.14.1 
CPU times: user 94 ms, sys: 0 ns, total: 94 ms 
Wall time: 94.2 ms 
(198000, 1) 
CPU times: user 120 ms, sys: 0 ns, total: 120 ms 
Wall time: 120 ms (198000, 1) 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: pandas 0.15.2 multi-indexed DataFrame sum #9049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: pandas 0.15.2 multi-indexed DataFrame sum #9049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions