Selecting only numeric or string columns names from PySpark DataFrame Last Updated : 22 Mar, 2023 Comments Improve Suggest changes Like Article Like Report In this article, we will discuss how to select only numeric or string column names from a Spark DataFrame. Methods Used:createDataFrame: This method is used to create a spark DataFrame.isinstance: This is a Python function used to check if the specified object is of the specified type.dtypes: It returns a list of tuple (columnName,type). The returned list contains all columns present in DataFrame with their data types.schema.fields: It is used to access DataFrame fields metadata. Method #1: In this method, dtypes function is used to get a list of tuple (columnName, type). Python3 from pyspark.sql import Row from datetime import date from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() # Creating dataframe from list of Row df = spark.createDataFrame([ Row(a=1, b='string1', c=date(2021, 1, 1)), Row(a=2, b='string2', c=date(2021, 2, 1)), Row(a=4, b='string3', c=date(2021, 3, 1)) ]) # Printing DataFrame structure print("DataFrame structure:", df) # Getting list of columns and printing # result dt = df.dtypes print("dtypes result:", dt) # Getting list of columns having type # string or bigint # This statement will loop over all the # tuples present in dt list # item[0] will contain column name and # item[1] will contain column type columnList = [item[0] for item in dt if item[1].startswith( 'string') or item[1].startswith('bigint')] print("Result: ", columnList) Output: DataFrame structure: DataFrame[a: bigint, b: string, c: date] dtypes result: [('a', 'bigint'), ('b', 'string'), ('c', 'date')] Result: ['a', 'b'] Method #2: In this method schema.fields is used to get fields metadata then column data type is extracted from metadata and compared with the desired data type. Python3 from pyspark.sql.types import StringType, LongType from pyspark.sql import Row from datetime import date from pyspark.sql import SparkSession # Initializing spark session spark = SparkSession.builder.getOrCreate() # Creating dataframe from list of Row df = spark.createDataFrame([ Row(a=1, b='string1', c=date(2021, 1, 1)), Row(a=2, b='string2', c=date(2021, 2, 1)), Row(a=4, b='string3', c=date(2021, 3, 1)) ]) # Printing DataFrame structure print("DataFrame structure:", df) # Getting and printing metadata meta = df.schema.fields print("Metadata: ", meta) # Getting list of columns having type # string or int # This statement will loop over all the fields # field.name will return column name and # field.dataType will return column type columnList = [field.name for field in df.schema.fields if isinstance( field.dataType, StringType) or isinstance(field.dataType, LongType)] print("Result: ", columnList) Output: DataFrame structure: DataFrame[a: bigint, b: string, c: date] Metadata: [StructField(a,LongType,true), StructField(b,StringType,true), StructField(c,DateType,true)] Result: ['a', 'b'] Comment More infoAdvertise with us Next Article Selecting only numeric or string columns names from PySpark DataFrame aman neekhara Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads PySpark - Select Columns From DataFrame In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected 2 min read PySpark DataFrame - Select all except one or a set of columns In this article, we are going to extract all columns except a set of columns or one column from Pyspark dataframe. For this, we will use the select(), drop() functions. But first, let's create Dataframe for demonestration. Python3 # importing module import pyspark # importing sparksession from pyspa 2 min read How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy 2 min read How to Get substring from a column in PySpark Dataframe ? In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n 3 min read Pivot String column on Pyspark Dataframe Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate 4 min read Filter PySpark DataFrame Columns with None or Null Values Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th 4 min read Select columns in PySpark dataframe In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We 4 min read Pyspark Dataframe - Map Strings to Numeric In this article, we are going to see how to convert map strings to numeric. Creating dataframe for demonstration: Here we are creating a row of data for college names and then pass the createdataframe() method and then we are displaying the dataframe. Python3 # importing module import pyspark # impo 3 min read Spark Trim String Column on DataFrame In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca 4 min read Get number of rows and columns of PySpark dataframe In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively. df.count(): This function is used to extract number of rows from t 6 min read Like