Skip to content

Multi-index and CategoricalIndex performance #22044

Closed
@0x0L

Description

@0x0L

Hi,

Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in

categories = CategoricalIndex(values.categories,

Unfortunately that call is painfully slow

import pandas as pd
import numpy as np

pd.__version__
# 0.23.3

x = np.linspace(-1, 1, 1_000_000)
i = pd.Index(x)
j = i.astype('category')

%timeit pd.MultiIndex.from_arrays([j])
180 ms ± 912 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The line above from categorical.py is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along

indexer = np.arange(len(values.categories), dtype=values.codes.dtype)
categories = pd.CategoricalIndex([], fastpath=True).set_categories(values.categories)
categories._data._codes = indexer
categories._cleanup()

On my machine, this takes ~2 ms

The 50 missing milliseconds are spent calling _shallow_copy in MultiIndex._set_levels. That copy may not be that shallow: looks like this boils down to yet another variation on the CategoricalIndex

%timeit categories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note however that

%timeit pd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I would have liked to make PR but I don't fully understand the code and all that class overloading

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions