How to Speedup Pandas with One-Line change using Modin ?

Last Updated : 23 May, 2024

In this article, we are going to see how to increase the speed of computation of the pandas using modin library. Modin is a python library very similar to pandas (almost identical in terms of syntax) capable of handling a huge dataset that cannot fit into RAM in one go. Pandas are good enough in terms of speed of execution for datasets in size of in MB's and few GB's but when we are dealing with really large datasets speed to process the data becomes the bottleneck.

Pandas library was designed to work on single-core and therefore with modern age compute power every personal laptop comes with now at least 2 cores and Modin just exploits this opportunity of executing the operations on all available cores thus speeding up the whole process.

To install Modin and all it's dependencies use any of the below pip commands.

pip install modin[ray]

Or,

pip install modin[dask]

Or,

pip install modin[all]

To limit the number of CPUs to use we can add the below 2 lines of code in your script

import os

# this specifies the number of
# CPUs to use. 
os.environ["MODIN_CPUS"] = "2"

Example 1: Dataframe Append Operation:

Append() operations are very common in pandas and in the code below here we have demonstrated this by running it 10 times using both pandas and Modin and timed it against each other to see the speedup difference. Clearly, Modin beats pandas as it uses all the cores available on my system. Also using the time module to measure the operations speed to compare with each other, and it turns out that Modin is 25x Times faster than pandas in this case.

Code:

Python3

import pandas as pd
import modin.pandas as mpd
import time

start = time.time()

# Creating a Custom Dataframe
data = {'Name': ['Tom', 'nick', 'krish', 'jack',
                 'ash', 'singh', 'shilpa', 'nav'],
        
        'Age': [20, 21, 19, 18, 6, 12, 18, 20]}

df = pd.DataFrame(data)

# Appending the dataframe to itself 10 times.
for _ in range(10):
    df = df.append(df)
    
end = time.time()
print(f"Pandas Appending Time :{end-start}")

start = time.time()
modin_df = mpd.DataFrame(data)

# Appending the dataframe to itself 10 times.
for _ in range(10):
    modin_df = modin_df.append(modin_df)
    
end = time.time()
print(f"Modin Appending Time :{end-start}")

Output:

Pandas Appending Time :0.682852745056152
Modin Appending Time :0.027661800384521484

Example 2: Modin is 4.4x Times faster than pandas.

Here we are using a CSV file of size 602 MB which can be downloaded from this link. Also renamed the file as demo.csv to keep it short. In the code below here we used fillna() method which goes through the entire DataFrame and fills all NaN values with the desired value in my example it's 0.

Code: