Remove duplicates from a dataframe in PySpark Last Updated : 16 Dec, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data =[["1","sravan","company 1"], ["2","ojaswi","company 1"], ["3","rohith","company 2"], ["4","sridevi","company 1"], ["1","sravan","company 1"], ["4","sridevi","company 1"]] # specify column names columns = ['Employee ID','Employee NAME','Company'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data,columns) print('Actual data in dataframe') dataframe.show() Output: Method 1: Using distinct() method It will remove the duplicate rows in the dataframe Syntax: dataframe.distinct() Where, dataframe is the dataframe name created from the nested lists using pyspark Example 1: Python program to drop duplicate data using distinct() function Python3 print('distinct data after dropping duplicate rows') # display distinct data dataframe.distinct().show() Output: Example 2: Python program to select distinct data in only two columns. We can use select () function along with distinct function to get distinct values from particular columns Syntax: dataframe.select(['column 1','column n']).distinct().show() Python3 # display distinct data in # Employee ID and Employee NAME dataframe.select(['Employee ID', 'Employee NAME']).distinct().show() Output: Method 2: Using dropDuplicates() method Syntax: dataframe.dropDuplicates() where, dataframe is the dataframe name created from the nested lists using pyspark Example 1: Python program to remove duplicate data from the employee table. Python3 # remove duplicate data # using dropDuplicates()function dataframe.dropDuplicates().show() Output: Example 2: Python program to remove duplicate values in specific columns Python3 # remove duplicate data # using dropDuplicates()function # in two columns dataframe.select(['Employee ID', 'Employee NAME']).dropDuplicates().show() Output: Comment More infoAdvertise with us Next Article Remove duplicates from a dataframe in PySpark sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads How to duplicate a row N time in Pyspark dataframe? In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here 4 min read Drop duplicate rows in PySpark DataFrame In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import Sp 2 min read Removing duplicate columns after DataFrame join in PySpark In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark. Create the first dataframe for demonstration:Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark \ - exa 3 min read Rename Duplicated Columns after Join in Pyspark dataframe In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python. A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, t 4 min read How to drop duplicates and keep one in PySpark dataframe In this article, we will discuss how to handle duplicate values in a pyspark dataframe. A dataset may contain repeated rows or repeated data points that are not useful for our task. These repeated values in our dataframe are called duplicate values. To handle duplicate values, we may use a strategy 3 min read How to display a PySpark DataFrame in table format ? In this article, we are going to display the data of the PySpark dataframe in table format. We are going to use show() function and toPandas function to display the dataframe in the required format. show(): Used to display the dataframe. Syntax: dataframe.show( n, vertical = True, truncate = n) wher 3 min read How to delete columns in PySpark dataframe ? In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing 2 min read Removing duplicate rows based on specific column in PySpark DataFrame In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column 1 min read Extract First and last N rows from PySpark DataFrame In data analysis, extracting the start and end of a dataset helps understand its structure and content. PySpark, widely used for big data processing, allows us to extract the first and last N rows from a DataFrame. In this article, we'll demonstrate simple methods to do this using built-in functions 2 min read PySpark Count Distinct from DataFrame In this article, we will discuss how to count distinct values present in the Pyspark DataFrame. In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL coun 6 min read Like