Description
Problem:
data.sum(level=...) for multi-index table produce different result (lots of NAs) than groupby
in certain cases. It's also much slower than groupby. Seems that the new version
produced a cross join of the keys and produce NAs for pair of keys with no data, which makes
the result bigger and significantly slower.
data.groupby(level=...).sum(). This happens in the following example:
Code:
import pandas as pd
print "-------------- pandas version: ", pd.__version__
max_num_of_syms = 4000
list_of_df = []
for i,a in enumerate(pd.Series(range(100)).astype(str)):
#Each 'A' has difference number of 'B' entries in order to produce the problem
num_of_syms = int(i*max_num_of_syms/100.0)# if i<3 else max_num_of_syms
#print num_of_syms
d = pd.DataFrame({'A': [a]*num_of_syms , 'B': pd.Series(range(num_of_syms)).astype(str), 'C':1})
list_of_df.append(d)
data = pd.concat(list_of_df).set_index(['A','B'])
%time a= data.sum(level=['A','B'])
print a.shape
#This is a lot faster
%time a= data.reset_index().groupby(['A','B']).sum()
print a.shape
-------------- pandas version: 0.15.1.dev
CPU times: user 876 ms, sys: 17 ms, total: 893 ms
Wall time: 894 ms
(392040, 1)
CPU times: user 109 ms, sys: 0 ns, total: 109 ms
Wall time: 108 ms
(198000, 1)
-------------- pandas version: 0.14.1
CPU times: user 94 ms, sys: 0 ns, total: 94 ms
Wall time: 94.2 ms
(198000, 1)
CPU times: user 120 ms, sys: 0 ns, total: 120 ms
Wall time: 120 ms (198000, 1)