Skip to content

PERF: dataframe construction from recarray is slow #44826

Closed
@GYHHAHA

Description

@GYHHAHA

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I find dataframe construction from recarray is slow, but from_records() is fast. This is unreasonable.

Suppose the recarray is generated from the following step:

n = 7
df = pd.DataFrame(
    {
        "A": np.random.rand(int(10**n)),
        "B": np.random.rand(int(10**n)),
        "C": ["a"]*int(10**n)
    }
)
arr = df.to_records(index=False)

Time comparison:

# Nearly 1 minute
>>>df_new = pd.DataFrame(arr)
# less than 0.1 second
>>>df_new = pd.DataFrame.from_records(arr)

The reason for this odd behaviour results from pandas.core.internals.construction.rec_array_to_mgr. The following code passes the recarray into _get_names_from_index, which has a large and totally unnecessary loop across the array.

index = _get_names_from_index(fdata)

has_some_name = any(getattr(s, "name", None) is not None for s in data)

And actually I personally believe this function is designed for the nested data since it's called in nested_data_to_arrays.

index = _get_names_from_index(data)

Thus maybe directly change to use default_index(len(data)) is a fix.

Installed Versions

1.3.4

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TriageIssue that has not been reviewed by a pandas team memberPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions