Multi-index and CategoricalIndex performance

Hi,

Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in

https://github.com/pandas-dev/pandas/blob/edb71fda022c6a155717e7a25679040ee0476639/pandas/core/arrays/categorical.py#L2510

Unfortunately that call is painfully slow

```python
import pandas as pd
import numpy as np

pd.__version__
# 0.23.3

x = np.linspace(-1, 1, 1_000_000)
i = pd.Index(x)
j = i.astype('category')

%timeit pd.MultiIndex.from_arrays([j])
180 ms ± 912 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

The line above from `categorical.py` is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along

```python
indexer = np.arange(len(values.categories), dtype=values.codes.dtype)
categories = pd.CategoricalIndex([], fastpath=True).set_categories(values.categories)
categories._data._codes = indexer
categories._cleanup()
```

On my machine, this takes ~2 ms

The 50 missing milliseconds are spent calling `_shallow_copy` in `MultiIndex._set_levels`. That copy may not be that shallow: looks like this boils down to yet another variation on the `CategoricalIndex`

```python
%timeit categories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Note however that

```python
%timeit pd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```

I would have liked to make PR but I don't fully understand the code and all that class overloading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-index and CategoricalIndex performance #22044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multi-index and CategoricalIndex performance #22044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions