DataWorks Data Quality helps you maintain high data quality by detecting changes in source data and identifying dirty data generated during the extract, transform, and load (ETL) process. It can automatically block problematic tasks to prevent dirty data from spreading to downstream nodes. This prevents unexpected data issues that could affect your operations and business decisions, reducing the time and resource costs associated with rerunning tasks and correcting data.
Billing
Data Quality checks data quality by using monitoring rules. The fees generated for Data Quality checks consist of the following two parts:
Fees included in your DataWorks bills
You are charged by DataWorks based on the number of Data Quality checks. For more information, see Billing of Data Quality.
Fees not included in your DataWorks bills
You are charged by the compute engines that are associated with your DataWorks workspace. When monitoring rules are triggered, SQL statements are generated and executed by specific compute engines.
In this case, you are charged for the computing resources consumed by the compute engines. For more information, see the topic about billing for each type of compute engine. For example, you associate a pay-as-you-go MaxCompute project billing method with your DataWorks workspace. In this case, you are charged when you execute SQL statements and the fees are included in your MaxCompute bills instead of your DataWorks bills.
Features
You can configure quality monitoring rules across multiple dimensions, including completeness, accuracy, validity, consistency, uniqueness, and timeliness. These rules can be associated with scheduling nodes so that, after a task finishes running, the quality checks are automatically triggered. This allows you to detect problematic data at the earliest opportunity and set rule severity levels to control whether a task fails and stops. This approach helps prevent the spread of dirty data and significantly reduces both the time and cost required for data recovery.
The features of each Data Quality module are described as follows:
Feature | Description | |
The Dashboard page displays an overview of data quality in your workspace. It includes key data quality metrics, trends and distribution of rule check instances, tables with the most data quality issues, issue owners, and the coverage status of monitoring rules. This helps data quality owners understand the overall data quality status of the workspace and promptly handle issues to improve data quality. | ||
Quality Assets | View all configured monitoring rules. | |
Manage user-defined rule templates to improve the efficiency of rule configuration. | ||
Configure Rules | Configure a monitoring rule for a single table or for multiple tables based on a rule template. | |
Quality O&M | View all monitors created in the current workspace. | |
View the results of monitors. After the moniter runs, you can view its details on this page. | ||
Quality Analysis | Create report templates and add metrics related to rule configuration and execution. Reports are generated and sent regularly based on the defined reporting period, dispatch time, and subscription details. |
Usage notes
The following table describes the data source types and the regions in which the data source types are supported.
Data source type
Supported regions
MaxCompute
StarRocks
MySQL
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).
E-MapReduce
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), and US (Silicon Valley).
Hologres
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).
AnalyticDB for PostgreSQL
China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), and Japan (Tokyo).
AnalyticDB for MySQL
China (Shenzhen), Singapore, and US (Silicon Valley).
CDH
China (Shanghai), China (Beijing), China (Zhangjiakou), China (Hong Kong), and Germany (Frankfurt).
Before you configure monitoring rules for E-MapReduce, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH, StarRocks and MySQL, you must first collect their metadata. For more information, see Collect metadata from an EMR data source.
For a monitoring rule on a table from E-MapReduce, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH, StarRocks and MySQL to be triggered, the scheduling node that generates the data must run on a resource group connected to that data source.
You can configure multiple monitoring rules for a table.
Scenarios
In offline data validation scenarios, you configure a monitoring rule for a table by specifying a partition filter expression and associating the rule with the scheduling node that generates the table's data. After the node runs, the monitoring rule is triggered to check the data in the partition that matches the filter expression. Note that dry-run tasks do not trigger monitoring rules. You can configure the rule as a strong or weak rule to determine whether to cause the task to fail if an anomaly is detected, which prevents dirty data from spreading downstream. On the rule configuration page, you can also specify notification methods to receive prompt alert notifications.
Configure a monitoring rule
Create a monitoring rule: You can create a rule for a single table, or create rules for multiple tables in bulk by using a template. For more information, see Configure a monitoring rule for a single table and Configure a monitoring rule for multiple tables based on a template.
Subscribe to a monitoring rule: After a rule is created, you can subscribe to it to receive alert notifications for data quality checks. The notification methods include Email, Email and SMS, DingTalk Chatbot, DingTalk Chatbot @ALL, Lark Group Chatbot, Enterprise Wechat Chatbot, Custom Webhook, and Telephone.
NoteThe Custom Webhook notification method is supported only in DataWorks Enterprise Edition.
Trigger the monitoring rule
After the scheduling node runs in Operation Center, the associated monitoring rule is triggered to check the quality of the data that the node generates. An SQL statement is generated and executed on the relevant compute engine. Based on the rule's strength (strong or weak) and its check result, DataWorks determines whether to cause the task to fail. This blocks downstream nodes from running and prevents dirty data from spreading.
View validation results
You can view validation results on the Monitor page. On the Running Records page, search by table or node to view the validation details of data quality monitoring. For more information, see View the details of a monitor.