Pyspark Dataframe - Map Strings to Numeric Last Updated : 29 Aug, 2022 Comments Improve Suggest changes Like Article Like Report In this article, we are going to see how to convert map strings to numeric. Creating dataframe for demonstration: Here we are creating a row of data for college names and then pass the createdataframe() method and then we are displaying the dataframe. Python3 # importing module import pyspark # importing sparksession from pyspark.sql module and Row module from pyspark.sql import SparkSession,Row # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of college data dataframe = spark.createDataFrame([Row("vignan"), Row("rvrjc"), Row("klu"), Row("rvrjc"), Row("klu"), Row("vignan"), Row("iit")], ["college"]) # display dataframe dataframe.show() Output: Method 1: Using map() function Here we created a function to convert string to numeric through a lambda expression Syntax: dataframe.select("string_column_name").rdd.map(lambda x: string_to_numeric(x[0])).map(lambda x: Row(x)).toDF(["numeric_column_name"]).show() where, dataframe is the pyspark dataframestring_column_name is the actual column to be mapped to numeric_column_namestring_to_numericis the function used to take numeric datalambda expression is to call the function such that numeric value is returned Here we are going to create a college spark dataframe using the Row method and then we are going to map the numeric value by using the lambda function and rename college name as college_number. For that, we are going to create a function and check the condition and return numeric value 1 if college is IIT, return numeric value 2 if college is vignan, return numeric value 3 if college is rvrjc, return numeric value 4 if college is other than above three Python3 # function that converts string to numeric def string_to_numeric(x): # return numeric value 1 if college is iit if(x == 'iit'): return 1 elif(x == "vignan"): # return numeric value 2 if college is vignan return 2 elif(x == "rvrjc"): # return numeric value 3 if college is rvrjc return 3 else: # return numeric value 4 if college # is other than above three return 4 # map the numeric value by using lambda # function and rename college name as college_number dataframe.select("college"). rdd.map(lambda x: string_to_numeric(x[0])). map(lambda x: Row(x)).toDF(["college_number"]).show() Output: Method 2: Using withColumn() method. Here we are using withColumn() method to select the columns. Syntax: dataframe.withColumn("string_column", when(col("column")=='value', 1)).otherwise(value)) Where dataframe is the pyspark dataframestring_column is the column to be mapped to numericvalue is the numeric value Example: Here we are going to create a college spark dataframe using Row method and  map college name with college number using with column method along with when(). Python3 # import col and when modules from pyspark.sql.functions import col, when # map college name with college number # using with column method along with when module dataframe.withColumn("college_number", when(col("college")=='iit', 1) .when(col("college")=='vignan', 2) .when(col("college")=='rvrjc', 3) .otherwise(4)).show() Output: Comment More infoAdvertise with us Next Article Pyspark Dataframe - Map Strings to Numeric sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Pivot String column on Pyspark Dataframe Pivoting in data analysis refers to the transformation of data from a long format to a wide format by rotating rows into columns. In PySpark, pivoting is used to restructure DataFrames by turning unique values from a specific column (often categorical) into new columns, with the option to aggregate 4 min read Spark Trim String Column on DataFrame In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself. PySpark Trim String Column on DataFrameBelow are the ways by which we ca 4 min read How to Get substring from a column in PySpark Dataframe ? In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column. We can get the substring of the column using substring() and substr() function. Syntax: substring(str,pos,len) df.col_n 3 min read PySpark - Split dataframe into equal number of rows When there is a huge dataset, it is better to split them into equal chunks and then process each dataframe individually. This is possible if the operation on the dataframe is independent of the rows. Each chunk or equally split dataframe then can be processed parallel making use of the resources mor 3 min read Selecting only numeric or string columns names from PySpark DataFrame In this article, we will discuss how to select only numeric or string column names from a Spark DataFrame. Methods Used:createDataFrame: This method is used to create a spark DataFrame.isinstance: This is a Python function used to check if the specified object is of the specified type.dtypes: It ret 2 min read How to check for a substring in a PySpark dataframe ? In this article, we are going to see how to check for a substring in PySpark dataframe. Substring is a continuous sequence of characters within a larger string size. For example, "learning pyspark" is a substring of "I am learning pyspark from GeeksForGeeks". Let us look at different ways in which w 5 min read How to re-partition pyspark dataframe in Python Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth 3 min read Split Dataframe in Row Index in Pyspark In this article, we are going to learn about splitting Pyspark data frame by row index in Python. In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by sp 5 min read PySpark - Extracting single value from DataFrame In this article, we are going to extract a single value from the pyspark dataframe columns. To do this we will use the first() and head() functions. Single value means only one value, we can extract this value based on the column name Syntax: dataframe.first()['column name']Dataframe.head()['Index' 2 min read PySpark Row using on DataFrame and RDD You can access the rows in the data frame like this: Attribute, dictionary value. Row allows you to create row objects using named arguments. A named argument cannot be omitted to indicate that the value is "none" or does not exist. In this case, you should explicitly set this to None. Subsequent ch 6 min read Like