Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
I find dataframe construction from recarray is slow, but from_records() is fast. This is unreasonable.
Suppose the recarray is generated from the following step:
n = 7
df = pd.DataFrame(
{
"A": np.random.rand(int(10**n)),
"B": np.random.rand(int(10**n)),
"C": ["a"]*int(10**n)
}
)
arr = df.to_records(index=False)
Time comparison:
# Nearly 1 minute
>>>df_new = pd.DataFrame(arr)
# less than 0.1 second
>>>df_new = pd.DataFrame.from_records(arr)
The reason for this odd behaviour results from pandas.core.internals.construction.rec_array_to_mgr. The following code passes the recarray into _get_names_from_index, which has a large and totally unnecessary loop across the array.
pandas/pandas/core/internals/construction.py
Line 179 in a3702e2
pandas/pandas/core/internals/construction.py
Line 724 in a3702e2
And actually I personally believe this function is designed for the nested data since it's called in nested_data_to_arrays.
pandas/pandas/core/internals/construction.py
Line 518 in a3702e2
Thus maybe directly change to use default_index(len(data)) is a fix.
Installed Versions
1.3.4
Prior Performance
No response