Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Wednesday, November 15, 2023

PySpark — Retrieve matching rows from two Dataframes

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.

If you are working as a PySpark developer, data engineer, data analyst, or data scientist for any organisation requires you to be familiar with dataframes because data manipulation is the act of transforming, cleansing, and organising raw data into a format that can be used for analysis and decision making.


For example, you have some user’s data in dataframe-1, and you have to new users’ data in a dataframe-2, then you must find out all the matched records from dataframe-2 and dataframe-1. In PySpark, you can retrieve matching rows from two Dataframes using the join operation. The join operation combines rows from two Dataframes based on a common column.

# importing sparksession from  
from pyspark.sql import SparkSession
# pyspark.sql module
from pyspark.sql.functions import col
# Create a Spark session and giving an app name
spark = SparkSession.builder.appName("UpdateMutliColumns"
).getOrCreate()

Dataset 1: In this dataset, we have three columns such as Name, Age and Occupation and have a pre-defined schema for our PySpark dataframe as given below — 

# Sample data for DataFrame1
dataset1 = [("Ryan Arjun"
, 25, "Engineer"),          ("Kimmy Wang", 30, "Data Scientist"),          ("Saurabh Yadav", 22, "Analyst")]

# Define the schema for DataFrame1
ds_schema1 = ["Name"
, "Age", "Occupation"]

PySpark Dataframe 1 from dataset 1 — In PySpark, we are going to call already existing pre-defined createDataFrame function which takes two parameters such as data and schema and passing the above dataset1 and ds_schema1 as given below-  


# Create DataFrames
df1 = spark.createDataFrame(dataset1, schema=ds_schema1)

### show the schema of the dataframe
df1.printSchema()
# Show the original DataFrames
print("DataFrame 1:")
df1.show()



Dataset 2:
 In this dataset, we have three columns such as Name, Sex and Country and have a pre-defined schema for our PySpark dataframe as given below —

# Sample data for DataFrame2
dataset2 = [("Ryan Arjun"
, "Male", "Indian"),          ("Kimmy Wang", "Female", "Japan"),          ("Lovish Singh", "Male", "China")]

# Define the schema for DataFrame2
ds_schema2 = ["Name"
, "Gender", "Country"]

 

PySpark Dataframe 2 from dataset 2 — In PySpark, we are going to call already existing pre-defined createDataFrame function which takes two parameters such as data and schema and passing the above dataset1 and ds_schema2 as given below-

# Create DataFrames for second dataset
df2 = spark.createDataFrame(dataset2, schema=ds_schema2)
### show the schema of the dataframe
df2.printSchema()
# Show the original DataFrames
print("DataFrame 2:") df2.show()


Get matching records from both dataframes — In this example, df1.join(df2, “Name”, “inner”) performs an inner join based on the “Name” column. The resulting DataFrame, joined_df, contains only the rows where the “Name” column is common in both Dataframes as given below —

# Join DataFrames based on the "Name" column
joined_df = df1.join(df2, "Name", "inner")
### show the schema of the dataframe
joined_df.printSchema()
# Show the original DataFrames
print("DataFrame with Matching rows:")
joined_df.show()



Note: You can adjust the join type (inner, left, right, full) based on your specific requirements. Additionally, if the column names are different in the two Dataframes, you can specify the join condition explicitly using the on parameter. You can adjust the join condition based on your specific use case and column names.

 

Now, you can see that it is just piece of cake to get the matching records from both dataframe based on your matching keys.

Lets learn more on the data validation side which is the most important part of the data engineering.

Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. 💫Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. 💫Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. 💫Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. 💫Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. 💫Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
🔊 http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
🔊 http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
🔊 https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
🔊
 https://twitter.com/macxima

Wednesday, November 8, 2023

Data Governance - Intelligent Data Quality

Data is the business asset for every organisation which is audited and protected. To gain in their business, it becomes very urgent for every organization to make sure that good quality data is available to everyone who needs it.


Note: Poor data governance regulations include many sources of truth, poorly defined KPIs, poor data quality, and an abundance of other issues.

 

Data Governance maintains data integrity and compliance, while Data Analysts give insights for decision-making; both positions interact to offer reliable and applicable insights. Data quality is measured on six critical factors, each of which is equally important, and they are: 

  1. Data Quality is ensuring that data is correct, consistent, and devoid of "noise" that might obstruct usage and analysis. 
  2. Data Availability entails making data available and easily consumable by the business operations that require it. 
  3. Data Usability entails ensuring that data is properly formatted, documented, and labelled, that it can be easily searched and retrieved, and that it is compatible with technologies used by business users.
  4. Data Integrity is the process of ensuring that data keeps its important properties even when it is stored, processed, transmitted, and accessed across several systems.
  5. Data Security entails categorizing data based on its sensitivity and developing protocols for protecting information and avoiding data loss and leakage.

As we know, bad data can come from every area in our organization under diverse forms from business departments, sales, marketing, or engineering. Data quality initiatives are generally centered on improving these metrics so that data will promote optimal performance of business systems and support user faith in the systems’ reliability.  

 

It involves the duties that people must carry out, the standards that they must adhere to, and the technology that supports them throughout the data life cycle. Data governance is the act of forming internal standards—data policies—that govern how data is gathered, preserved, processed, and disposed of. 


Data Governance Strategy

A data governance strategy is a framework used by organizations to manage, safeguard, and assure the quality and compliance of their data assets. It is critical for ensuring data integrity, privacy, and security, as well as assisting organizations in making educated decisions based on accurate and trustworthy data. The following are critical elements of a data governance strategy:

  1. Executive Sponsorship: Data governance requires support from top management to allocate resources and enforce data policies. Establish an executive sponsor or steering committee responsible for overseeing the strategy.
  2. Data Governance Framework: Develop a framework that defines the roles, responsibilities, and processes for data governance. This should include data owners, stewards, and users.
  3. Data Governance Policies: Create clear and comprehensive data governance policies that outline data classification, access controls, data retention, and data quality standards.
  4. Data Inventory: Identify and catalog all data assets within your organization. This includes structured and unstructured data, databases, spreadsheets, files, and data stored in various systems.
  5. Data Stewardship: Assign data stewards responsible for data quality, metadata management, and ensuring that data is used appropriately.
  6. Data Quality Management: Implement data quality tools and processes to monitor and improve data quality. Define data quality metrics and thresholds.
  7. Data Security: Establish security measures to protect sensitive and confidential data. This includes encryption, access controls, and compliance with relevant data protection regulations (e.g., GDPR, HIPAA).
  8. Data Privacy: Ensure compliance with data privacy regulations, such as GDPR or CCPA. Develop policies and procedures for handling personal data and obtaining consent where necessary.
  9. Data Classification: Categorize data based on its sensitivity, value, and regulatory requirements. This helps determine appropriate access controls and security measures.
  10. Data Lifecycle Management: Define the stages of data from creation to archiving and deletion. Implement data retention and disposal policies.
  11. Data Catalog and Metadata Management: Create a data catalog that provides a centralized repository of data assets and their metadata. Metadata should include data lineage, definitions, and business context.
  12. Data Governance Tools: Choose and implement data governance tools and platforms that facilitate data management, data quality, and metadata management.
  13. Data Governance Training and Awareness: Train employees on data governance principles and policies to ensure understanding and compliance.
  14. Data Governance Metrics and KPIs: Define key performance indicators (KPIs) to measure the effectiveness of your data governance program. Track metrics related to data quality, compliance, and security.
  15. Data Governance Communication: Establish a communication plan to ensure that all stakeholders are informed about data governance initiatives and changes.
  16. Data Governance Continuous Improvement: Regularly review and update your data governance strategy to adapt to changing business needs and evolving data regulations.
  17. Audit and Compliance: Conduct regular audits to ensure that data governance policies and practices are being followed and maintain compliance with relevant regulations.
  18. Data Governance Culture: Foster a culture of data stewardship and responsibility throughout the organization to ensure that everyone understands the importance of data governance.
  19. Data Governance Steering Committee: Create a committee responsible for making decisions, resolving issues, and overseeing the data governance program.
  20. Data Governance Roadmap: Develop a roadmap that outlines the phased implementation of your data governance strategy.

 

Remember that a data governance strategy is an ongoing effort that requires continuous attention and improvement. It is crucial for organizations to adapt and evolve their data governance practices meeting changing data management challenges and compliance requirements.

 

Data Governance Roles

 

Effective data governance requires the assignment of specific roles and responsibilities to individuals or teams within an organization. These roles ensure that data is managed, protected, and used in a way that aligns with organizational objectives and complies with relevant regulations. Here are some common data governance roles and their responsibilities:

  1. Data Steward - Responsibilities:
    1. Manages and maintains data assets.
    2. Ensures data quality and accuracy.
    3. Defines and enforces data standards and policies.
    4. Monitors data usage and compliance.
    5. Collaborates with data owners and users.
  2. Data Owner - Responsibilities:
    1. Accountable for specific data sets or domains.
    2. Determines who has access to the data.
    3. Ensures data is used in alignment with business goals.
    4. Collaborates with data stewards on data quality.
  3. Data Custodian - Responsibilities:
    1. Manages the technical aspects of data storage and access.
    2. Implements data security and access controls.
    3. Maintains data infrastructure and databases.
    4. Collaborates with data stewards and data owners.
  4. Chief Data Officer (CDO) - Responsibilities:
    1. Sets the overall data strategy for the organization.
    2. Ensures data governance policies and practices are in place.
    3. Manages data-related risks and compliance.
    4. Drives data-driven decision-making.
  5. Data Governance Manager/Director - Responsibilities:
    1. Oversees the data governance program.
    2. Develops and enforces data governance policies.
    3. Coordinates with data stewards and data owners.
    4. Reports to senior management on data governance progress.
  6. Data Quality Analyst - Responsibilities:
    1. Monitors and assesses data quality.
    2. Defines data quality metrics and thresholds.
    3. Investigates and resolves data quality issues.
    4. Collaborates with data stewards and data owners.
  7. Data Compliance Officer - Responsibilities:
    1. Ensures compliance with data protection and privacy regulations.
    2. Develops and enforces data privacy policies.
    3. Manages data access requests and consent.
    4. Collaborates with legal and compliance teams.
  8. Data Architect - Responsibilities:
    1. Designs data models and data structures.
    2. Ensures data is organized and accessible.
    3. Collaborates with data stewards and data owners on data design.
    4. Implements data integration strategies
  9. IT Security Officer - Responsibilities:
    1. Ensures data security and access controls.
    2. Manages encryption and authentication mechanisms.
    3. Protects data against cybersecurity threats.
    4. Collaborates with data custodians and data owners.
  10.  Business Analyst - Responsibilities:
    1. Defines business requirements for data.
    2. Ensures data supports business processes.
    3. Collaborates with data owners and data stewards.
    4. Validates data for reporting and analytics.
  11. Data Governance Steering Committee - Responsibilities:
    1. Provides leadership and decision-making authority for data governance.
    2. Sets data governance priorities and policies.
    3. Resolves conflicts and issues related to data management.

 

The specific roles and their responsibilities may vary depending on the organization's size, structure, and industry. In some cases, a single individual may take on multiple roles, especially in smaller organizations. The key is to establish clear lines of responsibility and accountability to ensure that data governance is effectively implemented and maintained. 


To learn more, please follow us -

To Learn more, please visit our YouTube channel at —

To Learn more, please visit our Instagram account at -

To Learn more, please visit our twitter account at -

Data Engineering — Azure Databricks or Azure Synapse Analytics

 The cloud is the fuel that powers today’s digital companies, with businesses paying solely for the specific services or resources that they consume over time.

Azure Synapse Analytics bridges the gap between these two worlds by providing a uniform experience for ingesting, preparing, managing, and serving data for instant BI and machine learning needs.

 

Databricks is ideal for the “processing” layer, whereas Azure Synapse Analytics is ideal for the serving layer due to access control, active directory integration, and interaction with other Microsoft products.

 

Azure Databricks and Azure Synapse Analytics are mostly used for machine learning, and Synapse is also a Data Warehouse, therefore it is optimised for OLAP.

Azure Synapse Analytics vs Azure Databricks

Apache Spark powers both Databricks and Synapse Analytics. With optimized Apache Spark support, Databricks allows users to select GPU-enabled clusters that do faster data processing and have higher data concurrency.


Azure Synapse Analytics
 is an umbrella term for a variety of analytics solutions. It is a combination of Azure Data Factory, Azure Synapse SQL Pools (essentially what was formerly known as Azure SQL Data Warehouse), and some added capabilities such as serverless Spark clusters and Jupyter notebooks, all within a browser IDE interface.

Azure Synapse architecture comprises the Storage, Processing, and Visualization layers. The Storage layer uses Azure Data Lake Storage, while the Visualization layer uses Power BI.

 

Azure Synapse Pipelines is a lightweight orchestrator and are ideal for basic extract-load procedures that require highly parameterized copy actions with ADLS2 or specialised SQL pool integration.

Some ideal features of Azure Synapse Analytics

  1. Azure Synapse offers cloud data warehousing, dashboarding, and machine learning analytics in a single workspace.
  2. It ingests all types of data, including relational and non-relational data, and it lets you explore this data with SQL.
  3. Azure Synapse uses massive parallel processing or MPP database technology, which allows it to manage analytical workloads and also aggregate and process large volumes of data in an efficient manner.
  4. It is compatible with a wide range of scripting languages like Scala, Python, .Net, Java, R, SQL, T-SQL, and Spark SQL.
  5. It facilitates easy integration with Microsoft and Azure solutions like Azure Data Lake, Azure Blob Storage, and more.
  6. It includes the latest security and privacy technologies such as real-time data masking, dynamic data masking, always-on encryption, Azure Active Directory authentication, and more.

Azure Synapse is an unrestricted analytics solution that combines business data warehousing and Big Data analytics. It allows you to query data on your own terms, leveraging serverless on-demand or provided resources — at scale.

Azure Databricks uses mostly open-source software and utilizes cloud companies’ compute and storage costs. Databricks would be that it integrates more easily into the Azure ecosystem, and it is substantially more streamlined and works right away. It is now fully knob-free, yet given what it can accomplish, it takes very little configuration.

 

Databricks on the other hand is a complete ecosystem build cloud native. It supports that you write SQL, Python, R and Scala. It was built by the founders of spark and comes with tools that improve sparks capability e.g., in query performance and speed, Delta Lake for a lake format with version control and possibility to clone data between environments.

 

Databricks architecture is not entirely a Data Warehouse. It accompanies a LakeHouse architecture that combines the best elements of Data Lakes and Data Warehouses for metadata management and data governance.

 

Azure Databricks offers streamlined workflows and an interactive workspace for collaboration between data scientists, data engineers, and business analysts.

 

Some ideal features of Azure Databricks — Databricks has a lot more customizability and they have some internal libraries that are useful for data engineers. For instance, Mosaic is a useful geospatial library that was created inside Databricks. There are some ideal features inside Databricks as given below-

  1. Databricks are more open and the types of features they are releasing cover most of the things such as data governance, security, change data capture.
  2. Databricks uses Python, Spark, R, Java, or SQL for performing Data Engineering and Data Science activities using notebooks.
  3. Databricks has AutoML, and instead of a black box at the end for inference, you receive a notebook with the code that built the model you want.
  4.  AutoML is a wonderful starting point for most ML use cases that can subsequently be customised; it’s also ideal for learning and transforming “citizen” Data Scientists into coding practitioners.
  5. Databricks compute will not auto-start, which means you have to leave the clusters running to be able to allow users to query DB data.
  6. Databricks has a CDC option on their tables where it allows you to track changes. You can use this feature to get a list of rows that changed on the Trillion row Delta table (~billion a day).
  7. Databricks is generally cheaper (cost for X performance), so it’s easier to keep a shared auto scaling cluster running in Databricks.
  8. Databricks provides a platform for integrated data science and advanced analysis, as well as secure connectivity for these domains.

 

Databricks also have Delta sharing which is an interesting idea around making it easier to integrate with your lake house. The biggest selling point for us is that Databricks has understood that data platforms today is about machine learning and advanced analytics.

 

In an ideal business scenario, you (data engineers) can use Databricks to build data pipelines to process data and save everything as delta tables in ADLSG2, then use Synapse Analytics serverless pool to consume those delta tables for further data analysis and reports.

 

Azure Synapse and Databricks support Notebooks that help developers to perform quick experiments. Synapse provides co-authoring of a notebook with a condition where one person must save the notebook before the other person observes the changes.

 

However, Databricks Notebooks support real-time co-authoring along with automated version control.

 

Azure Synapse Analytics over Azure Databricks

  1. Azure Synapse Analytics is less costly to process than Azure Databricks when Spark Pools are used appropriately. Databricks are expensive, yet they give benefits that most people will never have.
  2. You pay per query/GB with Azure Synapse serverless. Because of this, serverless computing is charged per query, Synapse appears to be a suitable fit.
  3. If you are planning to do machine learning and data science later than go with synapse. Synapse is like Databricks, data factory, and SQL Server in one place.

 Azure Databricks over Azure Synapse Analytics

  1. Databricks comes with what can be seen as Spark improved with multiple optimizations which can perform x 50 times better. They have built Phonton which will always outperform Spark.
  2. Azure Synapse has no versions control in notebooks as where Databricks have this.

 

Microsoft Azure Synapse Analytics is a scalable, cloud-based data warehousing solution and includes business intelligence, data analytics, and machine learning solutions for both relational and non-relational data.

 

Azure Databricks is a large data analytics service built on Apache Spark that is quick, simple, and collaborative. It is intended for data science and data engineering. It is designed to store all of your data on a single, open LakeHouse while also unifying all of your analytics and AI workloads.

 

We propose that you assess your requirements and select:

  1. If you want a lot of product knobs at the sacrifice of productivity, use Synapse. To be clear, Azure Synapse Analytics is a collection of goods under the same umbrella. It’s like IBM Watson in terms of data processing.
  2. If you want a more refined experience at the sacrifice of certain capabilities, choose Azure Databricks. Databricks, for example, does not provide a no-code ML, although AzureML does.
  3. If you want to construct pipelines without writing code, Azure Synapse Analytics is a no-brainer.
  4. Use Azure Databricks for sophisticated analytics, large amounts of data processing, machine learning, and notebooks.

To learn more, please follow us -
http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
https://twitter.com/macxima