In Apache Spark, transformations and actions are two fundamental concepts that play crucial roles in defining and executing Spark jobs. Understanding the difference between transformations and actions is essential for effectively designing and optimizing Spark applications.
⏩What are Transformations in Spark?
πTransformations in Spark are operations that are applied to RDDs (Resilient Distributed Datasets) to create a new RDD.
πWhen a transformation is applied to an RDD, it does not compute the result immediately. Instead, it creates a new RDD representing the transformed data but keeps track of the lineage (dependencies) between the original RDD and the transformed RDD.
πTransformations are lazy evaluated, meaning Spark delays the actual computation until an action is triggered.
πExamples of transformations include map(), filter(), flatMap(), groupByKey(), reduceByKey(), sortByKey(), etc.
⏩What are Actions in Spark?
πActions in Spark are operations that trigger the computation of a result from an RDD and return a non-RDD value.
πWhen an action is invoked on an RDD, Spark calculates the result of all transformations leading to that RDD based on its lineage and executes the computation.
πActions are eager evaluated, meaning they kick off the actual computation in Spark.
πExamples of actions include collect(), count(), reduce(), saveAsTextFile(), foreach(), take(), first(), etc.