How to Speedup Pandas with One-Line change using Modin ?
Last Updated :
23 May, 2024
In this article, we are going to see how to increase the speed of computation of the pandas using modin library. Modin is a python library very similar to pandas (almost identical in terms of syntax) capable of handling a huge dataset that cannot fit into RAM in one go. Pandas are good enough in terms of speed of execution for datasets in size of in MB's and few GB's but when we are dealing with really large datasets speed to process the data becomes the bottleneck.
Pandas library was designed to work on single-core and therefore with modern age compute power every personal laptop comes with now at least 2 cores and Modin just exploits this opportunity of executing the operations on all available cores thus speeding up the whole process.
To install Modin and all it's dependencies use any of the below pip commands.
pip install modin[ray]
Or,
pip install modin[dask]
Or,
pip install modin[all]
To limit the number of CPUs to use we can add the below 2 lines of code in your script
import os
# this specifies the number of
# CPUs to use.
os.environ["MODIN_CPUS"] = "2"
Example 1: Dataframe Append Operation:
Append() operations are very common in pandas and in the code below here we have demonstrated this by running it 10 times using both pandas and Modin and timed it against each other to see the speedup difference. Clearly, Modin beats pandas as it uses all the cores available on my system. Also using the time module to measure the operations speed to compare with each other, and it turns out that Modin is 25x Times faster than pandas in this case.
Code:
Python3
import pandas as pd
import modin.pandas as mpd
import time
start = time.time()
# Creating a Custom Dataframe
data = {'Name': ['Tom', 'nick', 'krish', 'jack',
'ash', 'singh', 'shilpa', 'nav'],
'Age': [20, 21, 19, 18, 6, 12, 18, 20]}
df = pd.DataFrame(data)
# Appending the dataframe to itself 10 times.
for _ in range(10):
df = df.append(df)
end = time.time()
print(f"Pandas Appending Time :{end-start}")
start = time.time()
modin_df = mpd.DataFrame(data)
# Appending the dataframe to itself 10 times.
for _ in range(10):
modin_df = modin_df.append(modin_df)
end = time.time()
print(f"Modin Appending Time :{end-start}")
Output:
Pandas Appending Time :0.682852745056152
Modin Appending Time :0.027661800384521484
Example 2: Modin is 4.4x Times faster than pandas.
Here we are using a CSV file of size 602 MB which can be downloaded from this link. Also renamed the file as demo.csv to keep it short. In the code below here we used fillna() method which goes through the entire DataFrame and fills all NaN values with the desired value in my example it's 0.
Code:
Python3
import pandas as pd
import modin.pandas as mpd
# Reading demo.csv file into pandas df
df = pd.read_csv("demo.csv")
s = time.time()
df = df.fillna(value=0)
e = time.time()
print(f"Pandas fillna Time: {e-s})
# Reading demo.csv file into modin df
modin_df = mpd.read_csv("demo.csv")
s = time.time()
modin_df = modin_df.fillna(value=0)
e = time.time()
print(f"Modin fillna Time: {e - s})
Output:
Pandas fillna Time: 1.2 sec
Modin fillna Time: 0.27 sec
Similar Reads
How to speed up Pandas with cuDF? Pandas data frames in Python are extremely useful; they provide an easy and flexible way to deal with data and a large number of in-built functions to handle, analyze, and process the data. While Pandas data frames have a decent processing time, still in the case of computationally intensive operati
4 min read
Change the order of index of a series in Pandas Suppose we want to change the order of the index of series, then we have to use the Series.reindex() Method of pandas module for performing this task. Series, which is a 1-D labeled array capable of holding any data. Syntax: pandas.Series(data, index, dtype, copy)Â Parameters: data takes ndarrys, li
2 min read
Reshape Wide DataFrame to Tidy with identifiers using Pandas Melt Sometimes we need to reshape the Pandas data frame to perform analysis in a better way. Reshaping plays a crucial role in data analysis. Pandas provide functions like melt and unmelt for reshaping. In this article, we will see what is Pandas Melt and how to use it to reshape wide to Tidy with identi
3 min read
How to Remove Index Column While Saving CSV in Pandas In this article, we'll discuss how to avoid pandas creating an index in a saved CSV file. Pandas is a library in Python where one can work with data. While working with Pandas, you may need to save a DataFrame to a CSV file. The Pandas library includes an index column in the output CSV file by defau
3 min read
Replacing Pandas or Numpy Nan with a None to use with MysqlDB The widely used relational database management system is known as MysqlDB. The MysqlDB doesn't understand and accept the value of 'Nan', thus there is a need to convert the 'Nan' value coming from Pandas or Numpy to 'None'. In this article, we will see how we can replace Pandas or Numpy 'Nan' with a
3 min read