PySpark - Create dictionary from data in two columns

Last Updated : 03 Jan, 2022

In this article, we are going to see how to create a dictionary from data in two columns in PySpark using Python.

Method 1: Using Dictionary comprehension

Here we will create dataframe with two columns and then convert it into a dictionary using Dictionary comprehension.

Python

# importing pyspark
# make sure you have installed the pyspark library
import pyspark

# Importing and creating a SparkSession
# to work on DataFrames
# The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
  'Practice_Session').getOrCreate()

# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
        ['Adam', 65],
        ['Michael', 56],
        ['Kelsey', 37],
        ['Chris', 49],
        ['Jonathan', 28],
        ['Anthony', 26],
        ['Esther', 48],
        ['Rachel', 52],
        ['Joseph', 56],
        ['Richard', 49],
        ]

columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)

# printing the DataFrame
df_pyspark.show()

# dictionary comprehension is used here
# Name column here is the key while Age
# columns is the value
# You can also use {row['Age']:row['Name']
# for row in df_pyspark.collect()},
# to reverse the key,value pairs


# collect() gives a list of
# rows in the DataFrame
result_dict = {row['Name']: row['Age'] 
               for row in df_pyspark.collect()}

# Printing a few key:value pairs of
# our final resultant dictionary
print(result_dict['John'])
print(result_dict['Michael'])
print(result_dict['Adam'])

Output :

Method 2: Converting PySpark DataFrame and using to_dict() method

Here are the details of to_dict() method:

to_dict() : PandasDataFrame.to_dict(orient='dict')

Parameters:

orient : str {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}
Determines the type of the values of the dictionary.

Return: It returns a Python dictionary corresponding to the DataFrame

Python

# importing pyspark
# make sure you have installed
# the pyspark library
import pyspark

# Importing and creating a SparkSession
# to work on DataFrames
# The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
  'Practice_Session').getOrCreate()

# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
        ['Adam', 65],
        ['Michael', 56],
        ['Kelsey', 37],
        ['Chris', 49],
        ['Jonathan', 28],
        ['Anthony', 26],
        ['Esther', 48],
        ['Rachel', 52],
        ['Joseph', 56],
        ['Richard', 49],
        ]

columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)

# printing the DataFrame
df_pyspark.show()

# COnvert PySpark dataframe to pandas
# dataframe
df_pandas = df_pyspark.toPandas()

# Convert the dataframe into
# dictionary
result = df_pandas.to_dict(orient='list')

# Print the dictionary
print(result)

Output :

Method 3: By iterating over a column of dictionary

Iterating through columns and producing a dictionary such that keys are columns and values are a list of values in columns.

For this, we need to first convert the PySpark DataFrame to a Pandas DataFrame

Python

# importing pyspark
# make sure you have installed the pyspark library
import pyspark

# Importing and creating a SparkSession to work on 
# DataFrames The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
  'Practice_Session').getOrCreate()

# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
        ['Adam', 65],
        ['Michael', 56],
        ['Kelsey', 37],
        ['Chris', 49],
        ['Jonathan', 28],
        ['Anthony', 26],
        ['Esther', 48],
        ['Rachel', 52],
        ['Joseph', 56],
        ['Richard', 49],
        ]

columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)

# printing the DataFrame
df_pyspark.show()

result = {}

# Convert PySpark DataFrame to Pandas
# DataFrame
df_pandas = df_pyspark.toPandas()

# Traverse through each column
for column in df_pandas.columns:

    # Add key as column_name and
    # value as list of column values
    result[column] = df_pandas[column].values.tolist()

# Print the dictionary
print(result)

Output :

PySpark - Create dictionary from data in two columns

pranavhfs1

Improve

Article Tags :

Practice Tags :

python

PySpark - Create dictionary from data in two columns

Method 1: Using Dictionary comprehension

Method 2: Converting PySpark DataFrame and using to_dict() method

Method 3: By iterating over a column of dictionary

Similar Reads

Thank You!

What kind of Experience do you want to share?