Skip to content

Using pd.DataFrame(tensor) is abnormally slow, you can make the following modifications  #44616

Closed
@YeahNew

Description

@YeahNew

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
import torch

row = 700000
col = 64
val_numpy = np.random.rand(row, col)
val_tensor = torch.randn(row, col)

numpy_pd_start_time = time.time()
va_numpy_pd = pd.DataFrame(val_numpy)
numpy_pd_end_time = time.time()
print("numpy to pd time:{:.4f}s".
      format(numpy_pd_end_time - numpy_pd_start_time))

tensor_numpy_pd_start_time = time.time()
val_tensor_pd1 = pd.DataFrame(val_tensor.numpy())
tensor_numpy_pd_end_time = time.time()
print("tensor to numpy to pd time:{:.4f} s".
      format(tensor_numpy_pd_end_time - tensor_numpy_pd_start_time))

tensor_pd_start_time = time.time()
val_tensor_pd2 = pd.DataFrame(val_tensor)
tensor_pd_end_time = time.time()
print("tensor to pd time:{:.4f} s".
      format(tensor_pd_end_time - tensor_pd_start_time))

Issue Description

Recently, using pd.DataFrame() to convert data of type torch.tensor to pandas DataFrame is very slow, while converting tensor to numpy and then to pandas DataFrame is very fast. The test code is shown in the Reproducible Example.
The code prints as follows:

numpy to pd time: 0.0013s
tensor to numpy to pd time:0.0005s
tensor to pd time:220.5251s

Then I read the source code and found that if the data accepted by pd.DataFrame() is tensor, tensor will be processed as list_like (line 682 in https://github.com/pandas-dev/pandas/blob/master/pandas/core/ frame.py) .
Mainly time-consuming in the following three stages:

data = list(data):2.5952s
nested_data_to_arrays: 214.7532s
arrays_to_mgr:2.5987s

In the nested_data_to_arrays stage, a large number of data type conversion operations are involved, the row-list is converted to col-list, and the operation is read by row.This will take a long time.

Sure,This method of use may not be appropriate, but now torch.tensor is widely used, and it is inevitable that it will be used directly in this way, resulting in low efficiency. So can you add a comment at line 467 in frame.py, like this: If data is a torch.tensor, you can transform it to numpy first(tensor.numpy()).
Or can I submit a PR? When it is judged that the input parameter is tensor, execute the conversion, and then execute the ''elif isinstance(data, (np.ndarray, Series, Index))'' judgment.

Looking forward to your reply ~

Installed Versions

pandas.version == 1.3.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCompatpandas objects compatability with Numpy or Python functionsConstructorsSeries/DataFrame/Index/pd.array ConstructorsPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions