Showing posts with label check duplicates in spark dataframe. Show all posts
Showing posts with label check duplicates in spark dataframe. Show all posts

Thursday, March 21, 2024

Spark Interview Question 2 - Difference between Transformation and Action in Spark?

In Apache Spark, transformations and actions are two fundamental concepts that play crucial roles in defining and executing Spark jobs. Understanding the difference between transformations and actions is essential for effectively designing and optimizing Spark applications.

What are Transformations in Spark?

πŸ‘‰Transformations in Spark are operations that are applied to RDDs (Resilient Distributed Datasets) to create a new RDD.
πŸ‘‰When a transformation is applied to an RDD, it does not compute the result immediately. Instead, it creates a new RDD representing the transformed data but keeps track of the lineage (dependencies) between the original RDD and the transformed RDD.
πŸ‘‰Transformations are lazy evaluated, meaning Spark delays the actual computation until an action is triggered.
πŸ‘‰Examples of transformations include map(), filter(), flatMap(), groupByKey(), reduceByKey(), sortByKey(), etc.



What are Actions in Spark?

πŸ‘‰Actions in Spark are operations that trigger the computation of a result from an RDD and return a non-RDD value.
πŸ‘‰When an action is invoked on an RDD, Spark calculates the result of all transformations leading to that RDD based on its lineage and executes the computation.
πŸ‘‰Actions are eager evaluated, meaning they kick off the actual computation in Spark.
πŸ‘‰Examples of actions include collect(), count(), reduce(), saveAsTextFile(), foreach(), take(), first(), etc.


Friday, February 23, 2024

DataBricks - How to find duplicate records in Dataframe by Scala

In this tutorial, you will learn " How to find duplicate records in Dataframe by using Scala?" in Databricks.





In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks: πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace. πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use. πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration. πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions. πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark. πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Read CSV file into Dataframe