Append data to an empty dataframe in PySpark
Last Updated :
23 Jul, 2025
In this article, we are going to see how to append data to an empty DataFrame in PySpark in the Python programming language.
Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema
The union() function is the most important for this operation. It is used to mix two DataFrames that have an equivalent schema of the columns.
Syntax : FirstDataFrame.union(Second DataFrame)
Returns : DataFrame with rows of both DataFrames.
Example:
In this example, we create a DataFrame with a particular schema and data create an EMPTY DataFrame with the same scheme and do a union of these two DataFrames using the union() function in the python language.
Python
# Importing PySpark and the SparkSession
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a spark session
spark_session = SparkSession.builder.appName(
'Spark_Session').getOrCreate()
# Creating an empty RDD to make a
# DataFrame with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns1 = StructType([StructField('Name', StringType(), False),
StructField('Salary', IntegerType(), False)])
# Creating an empty DataFrame
first_df = spark_session.createDataFrame(data=emp_RDD,
schema=columns1)
# Printing the DataFrame with no data
first_df.show()
# Hardcoded data for the second DataFrame
rows = [['Ajay', 56000], ['Srikanth', 89078],
['Reddy', 76890], ['Gursaidutt', 98023]]
columns = ['Name', 'Salary']
# Creating the DataFrame
second_df = spark_session.createDataFrame(rows, columns)
# Printing the non-empty DataFrame
second_df.show()
# Storing the union of first_df and
# second_df in first_df
first_df = first_df.union(second_df)
# Our first DataFrame that was empty,
# now has data
first_df.show()
Output :
+----+------+
|Name|Salary|
+----+------+
+----+------+
+----------+------+
| Name|Salary|
+----------+------+
| Ajay| 56000|
| Srikanth| 89078|
| Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+
+----------+------+
| Name|Salary|
+----------+------+
| Ajay| 56000|
| Srikanth| 89078|
| Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+
Method 2: Add a singular row to an empty DataFrame by converting the row into a DataFrame
We can use createDataFrame() to convert a single row in the form of a Python List. The details of createDataFrame() are :
Syntax : CurrentSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Parameters :
data :
- schema : str/list , optional: Contains a String or List of column names.
- samplingRatio : float, optional: A sample of rows for inference
- verifySchema : bool, optional: Verify data types of every row against the specified schema. The value is True by default.
Example:
In this example, we create a DataFrame with a particular schema and single row and create an EMPTY DataFrame with the same schema using createDataFrame(), do a union of these two DataFrames using union() function further store the above result in the earlier empty DataFrame and use show() to see the changes.
Python
# Importing PySpark and the SparkSession,
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a spark session
spark_session = SparkSession.builder.appName(
'Spark_Session').getOrCreate()
# Creating an empty RDD to make a DataFrame
# with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns = StructType([StructField('Stadium', StringType(), False),
StructField('Capacity', IntegerType(), False)])
# Creating an empty DataFrame
df = spark_session.createDataFrame(data=emp_RDD,
schema=columns)
# Printing the DataFrame with no data
df.show()
# Hardcoded row for the second DataFrame
added_row = [['Motera Stadium', 132000]]
# Creating the DataFrame
added_df = spark_session.createDataFrame(added_row, columns)
# Storing the union of first_df and second_df
# in first_df
df = df.union(added_df)
# Our first DataFrame that was empty,
# now has data
df.show()
Output :
+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+
+--------------+--------+
| Stadium|Capacity|
+--------------+--------+
|Motera Stadium| 132000|
+--------------+--------+
Method 3: Convert the empty DataFrame into a Pandas DataFrame and use the append() function
We will use toPandas() to convert PySpark DataFrame to Pandas DataFrame. Its syntax is :
Syntax : PySparkDataFrame.toPandas()
Returns : Corresponding Pandas DataFrame
We will then use the Pandas append() function. Its syntax is :
Syntax : PandasDataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)
Parameters :
- other : Pandas DataFrame, Numpy Series etc: The data that has to be appended.
- ignore_index : bool: If indexed a ignored then the indexes of the new DataFrame will have no relations to the older ones.
- sort : bool: Sort the columns if alignment of the columns in other and PandasDataFrame is different.
Example:
Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the indexes as we are getting a new DataFrame.Finally, we convert our final Pandas DataFrame to a Spark DataFrame using createDataFrame().
Python
# Importing PySpark and the SparkSession,
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a spark session
spark_session = SparkSession.builder.appName(
'Spark_Session').getOrCreate()
# Creating an empty RDD to make a DataFrame
# with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns = StructType([StructField('Stadium', StringType(), False),
StructField('Capacity', IntegerType(), False)])
# Creating an empty DataFrame
df = spark_session.createDataFrame(data=emp_RDD,
schema=columns)
# Printing the DataFrame with no data
df.show()
# Hardcoded row for the second DataFrame
added_row = [['Motera Stadium', 132000]]
# Creating the DataFrame whose data
# needs to be added
added_df = spark_session.createDataFrame(added_row,
columns)
# converting our PySpark DataFrames to
# Pandas DataFrames
pandas_added = added_df.toPandas()
df = df.toPandas()
# using append() function to add the data
df = df.append(pandas_added, ignore_index=True)
# reconverting our DataFrame back
# to a PySpark DataFrame
df = spark_session.createDataFrame(df)
# Printing resultant DataFrame
df.show()
Output :
+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+
+--------------+--------+
| Stadium|Capacity|
+--------------+--------+
|Motera Stadium| 132000|
+--------------+--------+
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. OOPs is a way of organizing code that uses objects and classes to represent real-world entities and their behavior. In OOPs, object has attributes thing th
11 min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Python Introduction Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Python Data Types Python Data types are the classification or categorization of data items. It represents the kind of value that tells what operations can be performed on a particular data. Since everything is an object in Python programming, Python data types are classes and variables are instances (objects) of thes
9 min read