PySpark Row-Wise Function Composition
Last Updated :
28 Apr, 2025
An interface for Apache Spark in Python is known as Pyspark. While coding in Pyspark, have you ever felt the need to apply the function row-wise and produce the result? Don't know how to achieve this? Continue reading this article further. In this article, we will discuss how to apply row-wise function composition on Pyspark data frame in Python.
PySpark Row-Wise Function Composition
Udf() method will use the lambda function to loop over data, and its argument will accept the lambda function, and the lambda value will become an argument for the function, we want to make as a UDF.
Syntax: udf(lambda #parameters: #action_to_perform_on_parameters, IntegerType())
First, import the required libraries, i.e. SparkSession, SQLContext, UDF, struct, and IntegerType. The SparkSession library is used to create the session, while the SQLContext library is used to create the main entry point for the data frame. The UDF library is used to write Python code and call it as though it were a SQL function, while the struct returns a string packed according to the given format. Also, IntegerType is used to convert an internal SQL object into a native Python object.
Now, create a spark session using the getOrCreate function. Then, create a main entry point for the data frame using the SQLContext function. Next, either create a data frame using the createDataFrame function or read the CSV file using the read.csv function. Later on, create a function which will be called and new column will be created. Further, call the function created in the previous step and create the column with a certain heading. Finally, display the updated data frame in the previous step.
Implementation:
In this example, we have created the data frame of 3*4 with values of 0 and 1. Then, we created two functions, first function count_zeros to count the number of zeros in each row and second function count_ones to count the number of ones in each row. Finally, we have created two new columns 'Zero Count' and 'One Count' and called the respective functions in these columns.
Python3
# Python program to implement Pyspark
# row-wise function composition
# Import the SparkSession, SQLContext,
# udf, struct IntegerType libraries
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Create a main entry point for data
# frame using SQLContext function
sqlContext = SQLContext(spark_session)
# Create a data frame using createDataFrame function
data_frame = sqlContext.createDataFrame(
[(0, 0, 0), (1, 0, 0), (0, 1, 1), (1, 1, 1)],
("X", "Y", "Z"))
# Create a function to calculate the zeros
count_zeros = udf(lambda row: len([i for i in row if i == 0]),
IntegerType())
# Create a function to calculate the ones
count_ones = udf(lambda row: len([j for j in row if j == 1]),
IntegerType())
# Call the function created in step
# 5 and create the column 'Zero Count'
updated_data_frame_1 = data_frame.withColumn("Zero count",
count_zeros(
struct([data_frame[x] for x in data_frame.columns])))
# Call the function created in step 6
# and create the column 'One Count'
updated_data_frame_2 = updated_data_frame_1.withColumn("One count",
count_ones(
struct([updated_data_frame_1[x] for x in updated_data_frame_1.columns])))
# Show the updated data frame
updated_data_frame_2.show()
Output:
+---+---+---+----------+---------+
| X| Y| Z|Zero count|One count|
+---+---+---+----------+---------+
| 0| 0| 0| 3| 0|
| 1| 0| 0| 2| 1|
| 0| 1| 1| 1| 3|
| 1| 1| 1| 0| 3|
+---+---+---+----------+---------+
Similar Reads
PySpark Window Functions PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and
8 min read
Python PySpark sum() Function PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
Convert Python Functions into PySpark UDF In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
How to join on multiple columns in Pyspark? In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python. Let's create the first dataframe: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app
3 min read
Split multiple array columns into rows in Pyspark Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spl
5 min read
Show partitions on a Pyspark RDD Did you ever get the requirement to show the partitions on a Pyspark RDD for the data frame you uploaded or partition the data and check if has been correctly partitioned? Don't know, how to achieve this. You can do this by using the getNumPartitions functions of Pyspark RDD. Want to know more about
3 min read
Apply same function to all fields of PySpark dataframe row Are you a data scientist or data analyst who handles a lot of data? Have you ever felt the need to apply the same function whether it is uppercase, lowercase, subtract, add, etc. to apply to all the fields of data frame rows? This is possible in Pyspark in not only one way but numerous ways. In this
6 min read
How to use Is Not Null in PySpark In data processing, handling null values is a crucial task to ensure the accuracy and reliability of the analysis. PySpark, the Python API for Apache Spark, provides powerful methods to handle null values efficiently. In this article, we will go through how to use the isNotNull method in PySpark to
4 min read
Applying function to PySpark Dataframe Column In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
Python PySpark pivot() Function The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separa
4 min read