Description
Hi,
Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in
pandas/pandas/core/arrays/categorical.py
Line 2510 in edb71fd
Unfortunately that call is painfully slow
import pandas as pd
import numpy as np
pd.__version__
# 0.23.3
x = np.linspace(-1, 1, 1_000_000)
i = pd.Index(x)
j = i.astype('category')
%timeit pd.MultiIndex.from_arrays([j])
180 ms ± 912 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
The line above from categorical.py
is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along
indexer = np.arange(len(values.categories), dtype=values.codes.dtype)
categories = pd.CategoricalIndex([], fastpath=True).set_categories(values.categories)
categories._data._codes = indexer
categories._cleanup()
On my machine, this takes ~2 ms
The 50 missing milliseconds are spent calling _shallow_copy
in MultiIndex._set_levels
. That copy may not be that shallow: looks like this boils down to yet another variation on the CategoricalIndex
%timeit categories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note however that
%timeit pd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I would have liked to make PR but I don't fully understand the code and all that class overloading