Removing Blank Strings from a PySpark Dataframe
Last Updated :
28 Apr, 2025
Cleaning and preprocessing data is a crucial step before it can be used for analysis or modeling. One of the common tasks in data preparation is removing empty strings from a Spark dataframe. A Spark dataframe is a distributed collection of data that is organized into rows and columns. It can be processed using parallel and distributed algorithms, making it an efficient and powerful tool for large-scale data processing and analysis. They are a fundamental part of the Apache Spark ecosystem and are widely used in big data processing and analytics. Removing empty strings ensures the data is accurate, consistent, and ready to be used for downstream tasks.
Procedure to Remove Blank Strings from a Spark Dataframe using Python
To remove blank strings from a Spark DataFrame, follow these steps:
- To load data into a Spark dataframe, one can use the spark.read.csv() method or create an RDD and then convert it to a dataframe using the toDF() method.
- Once the data is loaded, the next step is to identify the columns that have empty strings by using the df.columns attribute and the df.select() method.
- Then, use the df.filter() method to remove rows that have empty strings in the relevant columns. For example, df.filter(df.Name != '') can be used to filter out rows that have empty strings in the "Name" column.
- Finally, use the df.show() method to view the resulting dataframe and confirm that it does not have any empty strings.
Example 1.
Creating dataframe for demonestration.
Python3
# import the necessary libraries
from pyspark.sql import *
from pyspark.sql.functions import *
# create a SparkSession
spark = SparkSession.builder.appName('my_app').getOrCreate()
# create the dataframe
df = spark.createDataFrame([
('John', 23, 'Male'),
('', 25, 'Female'),
('Jane', 28, 'Female'),
('', 30, 'Male')
], ['Name', 'Age', 'Gender'])
# examine the database
df.show()
Output:
To remove rows that contain blank strings in the "Name" column, you can use the following code:
Python3
# Filter out the blank rows
# from 'Name' column of df
df = df.filter(df.Name != '')
# Examine df
df.show()
Output:
Example 2.
Creating dataframe for demonestration.
Python3
# import the necessary libraries
from pyspark.sql import *
from pyspark.sql.functions import *
# create a SparkSession
spark = SparkSession.builder.appName('my_app').getOrCreate()
# create the dataframe
df = spark.createDataFrame([
('John', 23, 'Male', '123 Main St.'),
('', 25, 'Female', '456 Market St.'),
('Jane', 28, 'Female', '789 Park Ave.'),
('', 30, 'Male', '')
], ['Name', 'Age', 'Gender', 'Address'])
# examine the dataframe
df.show()
Output:
To remove rows that contain blank strings in the "Name" and "Address" column, you can use the following code:
Python3
# filter out rows with blank strings
# in the "Name" and "Address" columns
df = df.filter((df.Name != '') & (df.Address != ''))
# examine the dataframe
df.show()
Output:
Example 3.
Creating dataframe for demonestration.
Python3
# import the necessary libraries
from functools import reduce
from pyspark.sql import *
from pyspark.sql.functions import *
# create a SparkSession
spark = SparkSession.builder.appName('my_app').getOrCreate()
# create the dataframe
df = spark.createDataFrame([
('John', 23, 'Male', '123 Main St.', '555-1234'),
('', 25, 'Female', '456 Market St.', ''),
('Jane', 28, 'Female', '789 Park Ave.', '555-9876'),
('', 30, 'Male', '', '555-4321')
], ['Name', 'Age', 'Gender', 'Address', 'Phone'])
# examine the dataframe
df.show()
Output:
All the rows with empty strings may be filtered out as follows:
Python3
# filter out rows with blank strings in all the columns
df = df.filter(reduce(lambda x, y: x & y,
[col(c) != '' for c in df.columns]))
# examine the dataframe
df.show()
# examine the dataframe
df.show()
Output:
In conclusion, it is often necessary to remove rows or columns that contain blank or empty strings from a Spark dataframe. This can be done using the df.filter() method, as illustrated in the article.
Similar Reads
PySpark Row using on DataFrame and RDD You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is "none" or does not exist. In this case, you should explicitly set this to None. Subsequent ch
6 min read
Spark Trim String Column on DataFrame In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca
4 min read
How take a random row from a PySpark DataFrame? In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language. Method 1 : PySpark sample() method PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame. Here are the details of th
4 min read
PySpark Collect() â Retrieve data from DataFrame Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program. So, in this article, we are going to learn how to re
6 min read
How to check for a substring in a PySpark dataframe ? In this article, we are going to see how to check for a substring in PySpark dataframe. Substring is a continuous sequence of characters within a larger string size. For example, "learning pyspark" is a substring of "I am learning pyspark from GeeksForGeeks". Let us look at different ways in which w
5 min read
How to Get substring from a column in PySpark Dataframe ? In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n
3 min read
PySpark - Select Columns From DataFrame In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected
2 min read
Pivot String column on Pyspark Dataframe Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate
4 min read
Renaming columns for PySpark DataFrames Aggregates In this article, we will discuss how to rename columns for PySpark dataframe aggregates using Pyspark. Dataframe in use: In PySpark, Â groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. These are available in funct
3 min read
Remove duplicates from a dataframe in PySpark In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati
2 min read