Python

Create Your First Dataframe In Pyspark

PySpark allows users to handle large datasets efficiently through distributed computing. Whether you’re new to Spark or looking to enhance your skills, let us delve into understanding how to create DataFrames and manipulate data effectively, unlocking the power of big data analytics with PySpark.

1. Introduction

1.1 What is PySpark?

PySpark is the Python API for Apache Spark, a powerful open-source distributed computing system. It enables developers to perform large-scale data processing and analytics in a parallel and fault-tolerant manner. PySpark integrates seamlessly with Python, allowing data engineers and data scientists to leverage the power of Spark using a familiar programming language.

Apache Spark is renowned for its speed and scalability, making it a go-to framework for big data processing. It supports diverse workloads, such as batch processing, real-time streaming, machine learning, and graph processing. With PySpark, Python users can tap into Spark’s distributed computing capabilities without delving into the complexities of Java or Scala.

PySpark’s ecosystem is rich, offering tools for handling structured, semi-structured, and unstructured data. This makes it ideal for modern data workflows, including ETL processes, data warehousing, and advanced analytics. One of its standout features is the DataFrame, which brings a tabular data abstraction similar to pandas but optimized for distributed environments.

1.2 What is a DataFrame?

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in pandas but designed for large-scale data processing. DataFrames provide an easy-to-use API for performing common data manipulation tasks, such as filtering, grouping, and aggregations, with built-in optimizations for distributed computing.

In PySpark, DataFrames are immutable and distributed across a cluster. This allows operations on the data to be executed in parallel, resulting in high performance even for massive datasets. DataFrames can handle data from various sources, such as structured files (e.g., CSV, JSON), databases, or distributed storage systems like Hadoop.

1.2.1 Feature

Key features of PySpark DataFrames include:

  • Schema: DataFrames have a defined schema that specifies the structure of data, including column names and data types.
  • Lazy Evaluation: Operations on DataFrames are executed only when an action (e.g., show(), collect()) is invoked, optimizing the processing pipeline.
  • Interoperability: DataFrames can interoperate with other Spark components, such as SQL and machine learning libraries.

2. Creating Your First PySpark DataFrame

We assume that you have Apache Spark and PySpark installed on your system. If not, you can install PySpark using pip:

pip install pyspark

Alternatively, you can leverage the Databricks Community Edition to practice and enhance your PySpark skills.

2.1 Code Example

Let’s walk through the process of creating a DataFrame in PySpark.

# Databricks notebook source
# MAGIC %md
# MAGIC # **Initializing pyspark**

# COMMAND ----------

from pyspark.sql import SparkSession
# Create a spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()

# COMMAND ----------

spark

# COMMAND ----------

help (spark.createDataFrame)

# COMMAND ----------

# MAGIC %md
# MAGIC # **Create dataframe from list of rows**

# COMMAND ----------

from pyspark.sql import Row
from datetime import datetime,date

# create a pyspark df from a list of rows
# here the spark will infer the schema based on input
df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2025,1,1), e=datetime(2025,1,1,12,0)),
    Row(a=2, b=3., c='string2', d=date(2025,2,1), e=datetime(2025,2,1,12,0)),
    Row(a=3, b=4., c='string3', d=date(2025,3,1), e=datetime(2025,3,1,12,0))
])

# COMMAND ----------

from pyspark.sql import Row
from datetime import datetime,date

# create a pyspark df from a list of rows and manually specifying schema
df1 = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2025,1,1), e=datetime(2025,1,1,12,0)),
    Row(a=2, b=3., c='string2', d=date(2025,2,1), e=datetime(2025,2,1,12,0)),
    Row(a=3, b=4., c='string3', d=date(2025,3,1), e=datetime(2025,3,1,12,0))
],
                           schema='a long, b double, c string, d date, e timestamp'
                           )


# COMMAND ----------

# show(): create the spark job/s and displays the output of the df
df1.show()

# COMMAND ----------

# MAGIC %md
# MAGIC # **Create dataframe from rdd**

# COMMAND ----------

columns = ["website", "language", "score"]
data = [("in28minutes", "java", "200"), ("trendytech", "bigdata", "200"), ("youtube", "sql", "300")]
rdd = spark.sparkContext.parallelize(data)

# COMMAND ----------

dfFromRdd = rdd.toDF()
dfFromRdd.printSchema()
dfFromRdd.show()

#note: we are not passing the column schema to `toDF()` method and hence generic column names are returned in the output.

# COMMAND ----------

dfFromRdd1 = rdd.toDF(columns)
dfFromRdd1.printSchema()
dfFromRdd1.show()

# COMMAND ----------

# MAGIC %md
# MAGIC # **Create dataframe from file**

# COMMAND ----------

help(spark.read)

# COMMAND ----------

df2= spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/pysparklearning/ProductCatalog.csv")
df2.printSchema()
df2.show()

# COMMAND ----------

df3= spark.read.format("csv").option("header", "false").option("inferschema","true").load("dbfs:/FileStore/shared_uploads/pysparklearning/ProductCatalog_NoHeader.csv")
df3.printSchema()
df3.show()

# COMMAND ----------

# MAGIC %md
# MAGIC # Create dataframe from list of dictionaries

# COMMAND ----------

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

data = [{"ID": 1, "Name": "Alice", "Age": 28},
        {"ID": 2, "Name": "Bob", "Age": 35},
        {"ID": 3, "Name": "Cathy", "Age": 29}]

schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema=schema)
df.show()

2.1.1 Code Explanation

The above provides an in-depth look at several methods for creating DataFrames in PySpark, a key component of Apache Spark. First, the code demonstrates how to initialize a Spark session using the SparkSession class, which is essential for working with PySpark. Once the session is established, DataFrames can be created from a variety of data sources. One common method is to use a list of rows, where PySpark automatically infers the schema from the provided data. Alternatively, you can explicitly define the schema by specifying the data types of each column.

Another approach is to create DataFrames from Resilient Distributed Datasets (RDDs), the fundamental data structure in Spark. By converting an RDD to a DataFrame, you can specify column names or rely on Spark’s default naming convention if none are provided.

Furthermore, PySpark enables you to load DataFrames directly from external files, such as CSV files, by using various options like setting headers or inferring schemas automatically from the file content.

Lastly, DataFrames can be constructed from a list of dictionaries, where you can also define the schema using StructType and StructField to precisely control the data types and structure. These methods offer flexibility when working with data in Spark, whether the source is in memory, from an external file, or a distributed RDD.

Since we’re using a Jupyter notebook, we’re unable to capture the output directly. However, you can download the source code from the downloads section, open the file in Visual Studio Code, and view the output there.

3. Conclusion

PySpark is an essential tool for big data processing, offering both scalability and simplicity. Creating your first DataFrame is a great starting point for exploring its capabilities. As you become more familiar with PySpark, you can experiment with transformations, actions, and integrations with other big data tools.

4. Download the Code

Download
You can download the Jupyter Notebook using the link provided: Create Your First Dataframe In Pyspark

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button