SlideShare a Scribd company logo
Webinar
Mike Calabrese
Team Lead/Senior Engineer
Bill Hayduk
Founder/CEO
Creating a Data Validation
& Testing Strategy
Copyright Real-Time Technology Solutions, Inc. 2019 CONFIDENTIAL – DO NOT distribute
Facts
Founded:
1996 (24th anniversary)
Location:
New York City (HQ)
Customer profile:
• Fortune 500 & mid-size
• 700+ customers
Strategic Partners:
IBM, Microsoft, Oracle,
Teradata, Cloudera,
HortonWorks, MongoDB,
SAP, Micro Focus
Other Software
Supported
QuerySurge, Selenium,
Appium, CitraTest,
Postman, Smart Bear,
JMeter, others
RTTS is the premier pure-play QA & Testing firm
that specializes in Test Automation
Data
Validation
Data Testing
Strategies
Intro
Assessment
Case Study
Data Validation Assessment by
Data
Validation
Data Testing
Strategies
Intro
Assessment
Case Study
Data Validation Assessment by RTTS
Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
Big Impacts of Big Data
Data Warehouse Marketplace
“the worldwide data warehouse management software market is forecast
to generate nearly $17 billion in revenue by 2020” - Forrester
Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon
Business Intelligence Marketplace
“The business intelligence (BI) and analytics software market is forecast to grow to
$22.8 billion by the end of 2020” - Gartner
SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders
DWH, BI, Big Data Marketplaces
Big Data Marketplace
“By the end of 2020, companies will spend > USD $72 billion on on Big Data
hardware, software, & professional services” - IDC
Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata,
SAP, MongoDB, MapR, DataStax, Snowflake.
Legacy DB
CRM/ERP
DB
Finance DB
Source Data
ETL Process
Target DWH
ETL Process
Business Intelligence (BI) & Analytics
Data Mart
Impacts of Bad Data
“On average, poor data quality costs organizations $14.2 million
annually.”
a software division ofQuerySurge™
“Dirty data costs the average business 15% to 25% of revenue.”
“Cleaning up data will lead to average cost savings of 33%, while
boosting revenue by an average of 31%.”
Data
Validation
Data Testing
Strategies
Intro
a software division of
Assessment
Case Study
Data Validation Assessment by
What is Data Validation?
Data Validation Testing
The process of verifying your data is completely and accurately moved
through your systems according to the business requirements.
Legacy DB
CRM/ERP
DB
Finance DB
Source Data ETL Process Target DWH
Extract
Transform
Load
• Data Completeness
Verifying that all data has been loaded from the sources to the target Data Warehouse.
Validate the correct data displays in BI reports.
Data Validation Testing
• Data Transformation
Ensuring that all data has been transformed
correctly during the extract-transform-load (ETL)
process.
• BI Report Testing
Verify that BI Reports are formatted correctly, calculated fields are validated, and data is verified
against the underlying data.
DATA VALIDATION TEST TYPES
• BI Performance Testing
Ensure your BI Reports can be generated in a reasonable amount of time
• Data Quality
Ensuring that the ETL process correctly rejects,
substitutes default values, corrects or ignores and
reports invalid data.
Finding Bad Data
Issue Description Possible Causes
Missing Data Data that does not make it into the target database
• Invalid or incorrect lookup table in the
transformation logic
• Bad data from the source database (Needs
cleansing)
• Invalid joins
Truncation of Data Data being lost by truncation of the data field
• Invalid field lengths on target database
• Transformation logic not considering field
lengths from source
Data Type Mismatch Data types not set up correctly on target database Source data field not configured correctly
Null Translation
Null source values not being transformed to correct
target values
Development team did not include the null
translation in the transformation logic
Wrong Translation
Opposite of the Null Translation error. Field should be
null but is populated with a non-null value or field
should be populated, but with the wrong value
Development team incorrectly translated the
source field for certain values
Misplaced Data
Source data fields not being transformed to the
correct target data field
Development team inadvertently mapped
the source data field to the wrong target data
field
Extra Records
Records which should not be in the ETL are included
in the ETL
Development team did not include filter in
their code
Not Enough Records
Records which should be in the ETL are included in
the ETL
Development team had a filter in their code
which should not have been there
Finding Bad Data (cont.)
Issue Description Possible Causes
Transformation Logic
Errors/Holes
Testing sometimes can lead to finding “holes” in the
transformation logic or realizing the logic is unclear
Development team did not take into account
special cases. For example international
cities that contain special language specific
characters might need to be dealt with in the
ETL code
Simple/Small Errors Capitalization, spacing and other small errors
Development team did not add an additional
space after a comma for populating the
target field.
Sequence Generator
Ensuring that the sequence number of reports are in
the correct order is very important when processing
follow-up reports or answering to an audit
Development team did not configure the
sequence generator correctly resulting in
records with a duplicate sequence number
Undocumented
Requirements
Find requirements that are “understood” but are not
actually documented anywhere
Several of the members of the development
team did not understand the “understood”
undocumented requirements.
Duplicate Records
Duplicate records are two or more records that
contain the same data
Development team did not add the
appropriate code to filter out duplicate
records
Numeric Field Precision
Numbers that are not formatted to the correct
decimal point or not rounded per specifications
Development team rounded the numbers to
the wrong decimal point
Rejected Rows Data rows that get rejected due to data issues
Development team did not take into account
data conditions that could break the ETL for
a particular row
Challenges
• How much data needs to be validated/tested?
• How do I ensure I am testing the proper data
permutations?
• What are the critical data endpoints that need
to be tested?
• How do I verify that the data from my various
source systems is propagating through the
architecture?
• How do I validate data in the cloud
environments?
• Is bad data making it into the architecture?
• How much of the data testing can be automated?
COST
Data Mapping Development
Unit Testing
QA Test Cycle
UAT
Testing
End
User
Solutions
Finding Bad Data
• Identify testing points
• Review data mappings
• Data Testing Strategies
• comparisons (source vs. target)
• row counts
• minus queries
• automation tools
Solutions
Data Testing Permutations
• Analyze the data mappings
• Develop a test Data Set
o Review Transformation Logic
▪ Case Statements
▪ Field Merges/ Field Splitting
▪ Translations (Lookups)
▪ Derived
• Replication of production data
• Homegrown or Freeware
• Enterprise solutions
o IBM InfoSphere Optim, GenRocket, SAP, Computer Associates
Test Data Generation
Solutions
How much data to validate?
• Requirements
• Regulatory authorities may require 100% of your data be tested.
• In other cases, 90% or 80% may be the goal.
• Time, resource and scope driven
• Release timeline
• Available resources
• Scope of authoring and executing tests
• Risk Assessment
• Business Acceptance Criteria – End users define their primary data use cases.
• Critical Path – Validate the data the flows through the high priority data
endpoints within in your system.
𝑇𝑒𝑠𝑡 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙
# 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒)
= # 𝑜𝑓 𝑑𝑎𝑦𝑠
𝑇𝑒𝑠𝑡 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙
# 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒)
= # 𝑜𝑓 𝑑𝑎𝑦𝑠
Solutions
Automation vs Manual
• Recurrence
• Avoid complicated single use test cases
• Focus on repeatable testing paths
• Ensure modularization of test data sets
• Test Data Sets
• Consider automation tool’s assigned hardware resources and performance
which must be able to handle the load of the data set under test
• Include time needed to prepare environments into your testing estimates
• Database Performance
• Set expectations on database hardware & responsiveness.
• SQL query response time will factor into overall test run times
Solutions
How do I test data in my cloud environment ?
• On-Prem vs Cloud
o Follow the same testing methodologies but with considerations for cloud
connections and scalability
o If an automated solution is being pursued, confirm the tools involved
allows for connectivity to your cloud environment
• Hybrid-Could Mapping
o Interface documentation
o Define entry & exit points if applicable
• Digital Transformation
o Clearly defined conversion
requirements and mappings
• Environment Scalability
• Define limitations on testing environment resources
Data
Validation
Data Testing
Strategies
Intro
a software division of
Assessment
Case Study
Data Validation Assessment by
Data Validation Assessment
What are the goals of a
Data Validation assessment?
• Receive an expert evaluation of your
current data validation process
• Provide recommendations on how to
improve your process
• Proposal for successful implementation
of your goals
Data Validation Assessment
Components of the Assessment
• Business analysis
• Data architecture analysis
• ETL testing process evaluation
• DataOps & DevOps evaluation
• Resource evaluation (optional)
• Metrics evaluation
• Risk assessment
Data Validation Assessment
Interview with Key Players
• Business/Data Analysts create requirements
• QA Testers develop and execute test plans and
test cases
• Architects set up environments
• Developers create ETL code, perform unit tests
• DBAs test for performance and stress
• Business Users perform functional User
Acceptance Tests
Data Validation Assessment
Process Review
• Review Requirements & Mapping documentation
• Testing Process Design
• Analysis of tools and DevOps/DataOps
• Reporting metrics evaluations
Data Validation Assessment
Deliverables
• Detailed analysis report with recommendations
for improvement
• Presentation to your team on our findings
• Proposal for successful implementation of your
goals
Data
Validation
Data Testing
Strategies
Intro
a software division of
Assessment
Case Study
Data Validation Assessment by
ETL Developer: Codes data movement based on Mapping Requirements
Data Warehouse
ETL
Data Tester: Tests data movement based on Mapping Requirements
Data Mart
ETL
Source Data Big Data lake
Testing Point #1 Testing Point #2 Testing Points #3
BI & Analytics
Testing Point #4
Tester tests BI
Reports
BI Analyst extracts
data for reports
Data Testing - Developer & Tester
Source-to-Target Map
It’s the critical element required to
efficiently plan the target Data
Stores. It also defines the Extract,
Transform, Load (ETL) process.
Intention:
✓ capture business rules
✓ data flow mapping and
✓ data movement requirements.
Mapping Doc specifies:
▪ Source input definition
▪ Target/output details
▪ Business & data transformation rules
▪ Absolute data quality requirements
▪ Optional data quality requirements.
Data Requirements = Mapping Document
Data Testing Strategies
Testing Methods
Minus Queries – Create a SQL source query and a SQL Target query. Utilizing SQL, subtract
source query results from target query results and subtract target query results from
source query results
Visual Compare – View source data and target
data and manually compare
Record Counts – Creating a SQL source and
target query to return a record counts and
comparing the values
Automation – Utilizing an automation tool to compare SQL source and target query results
Sampling
Level
1
Sampling a % of data by visually comparing data sets. Not repeatable.
Excel, Ad Hoc Reporting
Level
2
Using Excel or other homegrown method. Ad hoc reporting.
Minus Queries
Level
3
Utilizing SQL editor & minus queries to test data. More
detailed reporting.
Data Test Automation
Level
4
Repeatable test automation, agreed-upon process, centralized
reporting.
On which Level
should your
process be?
Data Quality Optimizing
Level
5
Full automation, tracking of ROI, predictive data issues, auditable
results. Business value is fully understood/supported by management.
Data Maturity Model - Test Execution
Data
Validation
Data Testing
Strategies
Intro
a software division of
Assessment
Case Study
Data Validation Assessment by
A company in the financial industry had a development and QA team assigned to
their ETL process. But there were still issues:
Case Study
• They were still suffering from incorrect data
fields populating their Business Intelligence
(BI) reports
• Development cycles were frequently delayed
• Management was losing confidence in the BI
reporting data
CASE STUDY
OVERVIEW
Senior RTTS resources were brought in to assess the process
• Interview key players
• Review process documentation and tools
• Minimal requirements
• Ticketing system was not being implemented for
traceability
• Testing process of low-level maturity
o Table row counts
o Sampling
o Excel comparisons
Problem areas identified:
Case Study
Resource needs:
Case Study
Recommendations for Improvement
• Centralized mapping documentation
o Linking requirements to work items
tickets to test cases.
• Improve communications between team
members we recommended a new Data
Analyst role
• Narrowed focus of the stand-up meetings
• Implemented automated solutions to
expand coverage for larger data sets
DEMO:
Automating your data validation & testing
Any questions?
Creating a Data Validation & Testing Strategy
Ad

Recommended

What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
RTTS
 
DATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing Plan
Madhu Nepal
 
ETL Using Informatica Power Center
ETL Using Informatica Power Center
Edureka!
 
Informatica PowerCenter
Informatica PowerCenter
Ramy Mahrous
 
Informatica Tutorial For Beginners | Informatica Powercenter Tutorial | Edureka
Informatica Tutorial For Beginners | Informatica Powercenter Tutorial | Edureka
Edureka!
 
QuerySurge - the automated Data Testing solution
QuerySurge - the automated Data Testing solution
RTTS
 
The data quality challenge
The data quality challenge
Lenia Miltiadous
 
ETL Process
ETL Process
Rohin Rangnekar
 
ETL Testing Training Presentation
ETL Testing Training Presentation
Apurba Biswas
 
ETL
ETL
Mallikarjuna G D
 
How to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Data Engineering Basics
Data Engineering Basics
Catherine Kimani
 
What is ETL?
What is ETL?
Ismail El Gayar
 
Getting started with Tableau
Getting started with Tableau
Parth Acharya
 
Data warehouse
Data warehouse
shachibattar
 
Talend Open Studio Data Integration
Talend Open Studio Data Integration
Roberto Marchetto
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Introduction to ETL and Data Integration
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
Data warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data warehouse concepts
Data warehouse concepts
obieefans
 
Data warehouse
Data warehouse
Medma Infomatix (P) Ltd.
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
BugRaptors
 
Etl testing
Etl testing
Sandip Patil
 
ETL QA
ETL QA
dillip kar
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data Fabric
Alan McSweeney
 
Data Mesh
Data Mesh
Piethein Strengholt
 
Ppt
Ppt
bullsrockr666
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Cognizant
 
Etl testing strategies
Etl testing strategies
sivam_1
 

More Related Content

What's hot (20)

ETL Testing Training Presentation
ETL Testing Training Presentation
Apurba Biswas
 
ETL
ETL
Mallikarjuna G D
 
How to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Data Engineering Basics
Data Engineering Basics
Catherine Kimani
 
What is ETL?
What is ETL?
Ismail El Gayar
 
Getting started with Tableau
Getting started with Tableau
Parth Acharya
 
Data warehouse
Data warehouse
shachibattar
 
Talend Open Studio Data Integration
Talend Open Studio Data Integration
Roberto Marchetto
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Introduction to ETL and Data Integration
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
Data warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data warehouse concepts
Data warehouse concepts
obieefans
 
Data warehouse
Data warehouse
Medma Infomatix (P) Ltd.
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
BugRaptors
 
Etl testing
Etl testing
Sandip Patil
 
ETL QA
ETL QA
dillip kar
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data Fabric
Alan McSweeney
 
Data Mesh
Data Mesh
Piethein Strengholt
 
Ppt
Ppt
bullsrockr666
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
ETL Testing Training Presentation
ETL Testing Training Presentation
Apurba Biswas
 
Getting started with Tableau
Getting started with Tableau
Parth Acharya
 
Talend Open Studio Data Integration
Talend Open Studio Data Integration
Roberto Marchetto
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Data warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data warehouse concepts
Data warehouse concepts
obieefans
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
BugRaptors
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data Fabric
Alan McSweeney
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 

Similar to Creating a Data validation and Testing Strategy (20)

Leveraging Automated Data Validation to Reduce Software Development Timeline...
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Cognizant
 
Etl testing strategies
Etl testing strategies
sivam_1
 
Deliver Trusted Data by Leveraging ETL Testing
Deliver Trusted Data by Leveraging ETL Testing
Cognizant
 
Data Verification In QA Department Final
Data Verification In QA Department Final
Wayne Yaddow
 
DWBI Testing and Analytics Testing Services
DWBI Testing and Analytics Testing Services
CODETRU Software Solutions
 
Introduction to ETL process
Introduction to ETL process
Omid Vahdaty
 
ETL Testing Services - Safeguard Your Data
ETL Testing Services - Safeguard Your Data
BugRaptors
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
RTTS
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
JaveriaGauhar
 
Top 20 ETL Testing Interview Questions.pdf
Top 20 ETL Testing Interview Questions.pdf
AnanthReddy38
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Etl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large Applications
Wayne Yaddow
 
Data Quality at the Speed of Work
Data Quality at the Speed of Work
TechWell
 
Automate ETL Testing, Data Warehouse & Migration Testing The Agile Way - iceDQ
Automate ETL Testing, Data Warehouse & Migration Testing The Agile Way - iceDQ
iceDQ
 
What are the characteristics and objectives of ETL testing_.docx
What are the characteristics and objectives of ETL testing_.docx
Technogeeks
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOps
Steven Ensslen
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
RTTS
 
Visionbi Quality Gates
Visionbi Quality Gates
Ram Yonish
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality 
Precisely
 
Data Warehouse Testing—The Next Opportunity for QA Leaders
Data Warehouse Testing—The Next Opportunity for QA Leaders
Tricentis
 
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Leveraging Automated Data Validation to Reduce Software Development Timeline...
Cognizant
 
Etl testing strategies
Etl testing strategies
sivam_1
 
Deliver Trusted Data by Leveraging ETL Testing
Deliver Trusted Data by Leveraging ETL Testing
Cognizant
 
Data Verification In QA Department Final
Data Verification In QA Department Final
Wayne Yaddow
 
Introduction to ETL process
Introduction to ETL process
Omid Vahdaty
 
ETL Testing Services - Safeguard Your Data
ETL Testing Services - Safeguard Your Data
BugRaptors
 
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
Enterprise Business Intelligence & Data Warehousing: The Data Quality Conundrum
RTTS
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
JaveriaGauhar
 
Top 20 ETL Testing Interview Questions.pdf
Top 20 ETL Testing Interview Questions.pdf
AnanthReddy38
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Etl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large Applications
Wayne Yaddow
 
Data Quality at the Speed of Work
Data Quality at the Speed of Work
TechWell
 
Automate ETL Testing, Data Warehouse & Migration Testing The Agile Way - iceDQ
Automate ETL Testing, Data Warehouse & Migration Testing The Agile Way - iceDQ
iceDQ
 
What are the characteristics and objectives of ETL testing_.docx
What are the characteristics and objectives of ETL testing_.docx
Technogeeks
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOps
Steven Ensslen
 
Completing the Data Equation: Test Data + Data Validation = Success
Completing the Data Equation: Test Data + Data Validation = Success
RTTS
 
Visionbi Quality Gates
Visionbi Quality Gates
Ram Yonish
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality 
Precisely
 
Data Warehouse Testing—The Next Opportunity for QA Leaders
Data Warehouse Testing—The Next Opportunity for QA Leaders
Tricentis
 
Ad

More from RTTS (20)

Leveraging AI to Simplify and Speed Up ETL Testing
Leveraging AI to Simplify and Speed Up ETL Testing
RTTS
 
Improving Automated Testing Projects with UFT
Improving Automated Testing Projects with UFT
RTTS
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Automated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS
 
QuerySurge AI webinar
QuerySurge AI webinar
RTTS
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
RTTS
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
RTTS
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
RTTS
 
RTTS Postman and API Testing Webinar Slides.pdf
RTTS Postman and API Testing Webinar Slides.pdf
RTTS
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
RTTS
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
RTTS
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
RTTS
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinar
RTTS
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
the Data World Distilled
the Data World Distilled
RTTS
 
QuerySurge for DevOps
QuerySurge for DevOps
RTTS
 
Leveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE Vertica
RTTS
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
RTTS
 
Leveraging AI to Simplify and Speed Up ETL Testing
Leveraging AI to Simplify and Speed Up ETL Testing
RTTS
 
Improving Automated Testing Projects with UFT
Improving Automated Testing Projects with UFT
RTTS
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Automated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS
 
QuerySurge AI webinar
QuerySurge AI webinar
RTTS
 
State of the Market - Data Quality in 2023
State of the Market - Data Quality in 2023
RTTS
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
RTTS
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
RTTS
 
RTTS Postman and API Testing Webinar Slides.pdf
RTTS Postman and API Testing Webinar Slides.pdf
RTTS
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
RTTS
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
Webinar - QuerySurge and Azure DevOps in the Azure Cloud
RTTS
 
Implementing Azure DevOps with your Testing Project
Implementing Azure DevOps with your Testing Project
RTTS
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinar
RTTS
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
the Data World Distilled
the Data World Distilled
RTTS
 
QuerySurge for DevOps
QuerySurge for DevOps
RTTS
 
Leveraging HPE ALM & QuerySurge to test HPE Vertica
Leveraging HPE ALM & QuerySurge to test HPE Vertica
RTTS
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
RTTS
 
Ad

Recently uploaded (20)

Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 

Creating a Data validation and Testing Strategy

  • 1. Webinar Mike Calabrese Team Lead/Senior Engineer Bill Hayduk Founder/CEO Creating a Data Validation & Testing Strategy
  • 2. Copyright Real-Time Technology Solutions, Inc. 2019 CONFIDENTIAL – DO NOT distribute
  • 3. Facts Founded: 1996 (24th anniversary) Location: New York City (HQ) Customer profile: • Fortune 500 & mid-size • 700+ customers Strategic Partners: IBM, Microsoft, Oracle, Teradata, Cloudera, HortonWorks, MongoDB, SAP, Micro Focus Other Software Supported QuerySurge, Selenium, Appium, CitraTest, Postman, Smart Bear, JMeter, others RTTS is the premier pure-play QA & Testing firm that specializes in Test Automation
  • 6. Handles more than 1 million customer transactions every hour. • data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of Congress. Facebook handles 40 billion photos from its user base. Google processes 1 Terabyte per hour Twitter processes 85 million tweets per day eBay processes 80 Terabytes per day others Big Impacts of Big Data
  • 7. Data Warehouse Marketplace “the worldwide data warehouse management software market is forecast to generate nearly $17 billion in revenue by 2020” - Forrester Top vendors: Oracle, Teradata, IBM, Microsoft, SAP, Micro Focus and Amazon Business Intelligence Marketplace “The business intelligence (BI) and analytics software market is forecast to grow to $22.8 billion by the end of 2020” - Gartner SAP, IBM, SAS, Microsoft, Oracle, Tableau, Qlik, MicroStrategy , Information Builders DWH, BI, Big Data Marketplaces Big Data Marketplace “By the end of 2020, companies will spend > USD $72 billion on on Big Data hardware, software, & professional services” - IDC Oracle, IBM, Microsoft, Amazon, Micro Focus, HortonWorks, Cloudera, Teradata, SAP, MongoDB, MapR, DataStax, Snowflake.
  • 8. Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target DWH ETL Process Business Intelligence (BI) & Analytics Data Mart
  • 9. Impacts of Bad Data “On average, poor data quality costs organizations $14.2 million annually.” a software division ofQuerySurge™ “Dirty data costs the average business 15% to 25% of revenue.” “Cleaning up data will lead to average cost savings of 33%, while boosting revenue by an average of 31%.”
  • 10. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  • 11. What is Data Validation? Data Validation Testing The process of verifying your data is completely and accurately moved through your systems according to the business requirements. Legacy DB CRM/ERP DB Finance DB Source Data ETL Process Target DWH Extract Transform Load
  • 12. • Data Completeness Verifying that all data has been loaded from the sources to the target Data Warehouse. Validate the correct data displays in BI reports. Data Validation Testing • Data Transformation Ensuring that all data has been transformed correctly during the extract-transform-load (ETL) process. • BI Report Testing Verify that BI Reports are formatted correctly, calculated fields are validated, and data is verified against the underlying data. DATA VALIDATION TEST TYPES • BI Performance Testing Ensure your BI Reports can be generated in a reasonable amount of time • Data Quality Ensuring that the ETL process correctly rejects, substitutes default values, corrects or ignores and reports invalid data.
  • 13. Finding Bad Data Issue Description Possible Causes Missing Data Data that does not make it into the target database • Invalid or incorrect lookup table in the transformation logic • Bad data from the source database (Needs cleansing) • Invalid joins Truncation of Data Data being lost by truncation of the data field • Invalid field lengths on target database • Transformation logic not considering field lengths from source Data Type Mismatch Data types not set up correctly on target database Source data field not configured correctly Null Translation Null source values not being transformed to correct target values Development team did not include the null translation in the transformation logic Wrong Translation Opposite of the Null Translation error. Field should be null but is populated with a non-null value or field should be populated, but with the wrong value Development team incorrectly translated the source field for certain values Misplaced Data Source data fields not being transformed to the correct target data field Development team inadvertently mapped the source data field to the wrong target data field Extra Records Records which should not be in the ETL are included in the ETL Development team did not include filter in their code Not Enough Records Records which should be in the ETL are included in the ETL Development team had a filter in their code which should not have been there
  • 14. Finding Bad Data (cont.) Issue Description Possible Causes Transformation Logic Errors/Holes Testing sometimes can lead to finding “holes” in the transformation logic or realizing the logic is unclear Development team did not take into account special cases. For example international cities that contain special language specific characters might need to be dealt with in the ETL code Simple/Small Errors Capitalization, spacing and other small errors Development team did not add an additional space after a comma for populating the target field. Sequence Generator Ensuring that the sequence number of reports are in the correct order is very important when processing follow-up reports or answering to an audit Development team did not configure the sequence generator correctly resulting in records with a duplicate sequence number Undocumented Requirements Find requirements that are “understood” but are not actually documented anywhere Several of the members of the development team did not understand the “understood” undocumented requirements. Duplicate Records Duplicate records are two or more records that contain the same data Development team did not add the appropriate code to filter out duplicate records Numeric Field Precision Numbers that are not formatted to the correct decimal point or not rounded per specifications Development team rounded the numbers to the wrong decimal point Rejected Rows Data rows that get rejected due to data issues Development team did not take into account data conditions that could break the ETL for a particular row
  • 15. Challenges • How much data needs to be validated/tested? • How do I ensure I am testing the proper data permutations? • What are the critical data endpoints that need to be tested? • How do I verify that the data from my various source systems is propagating through the architecture? • How do I validate data in the cloud environments? • Is bad data making it into the architecture? • How much of the data testing can be automated?
  • 16. COST Data Mapping Development Unit Testing QA Test Cycle UAT Testing End User Solutions Finding Bad Data • Identify testing points • Review data mappings • Data Testing Strategies • comparisons (source vs. target) • row counts • minus queries • automation tools
  • 17. Solutions Data Testing Permutations • Analyze the data mappings • Develop a test Data Set o Review Transformation Logic ▪ Case Statements ▪ Field Merges/ Field Splitting ▪ Translations (Lookups) ▪ Derived • Replication of production data • Homegrown or Freeware • Enterprise solutions o IBM InfoSphere Optim, GenRocket, SAP, Computer Associates Test Data Generation
  • 18. Solutions How much data to validate? • Requirements • Regulatory authorities may require 100% of your data be tested. • In other cases, 90% or 80% may be the goal. • Time, resource and scope driven • Release timeline • Available resources • Scope of authoring and executing tests • Risk Assessment • Business Acceptance Criteria – End users define their primary data use cases. • Critical Path – Validate the data the flows through the high priority data endpoints within in your system. 𝑇𝑒𝑠𝑡 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒) = # 𝑜𝑓 𝑑𝑎𝑦𝑠 𝑇𝑒𝑠𝑡 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 ∗ (# 𝑜𝑓 ℎ𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑛𝑔 𝑝𝑒𝑟 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒) = # 𝑜𝑓 𝑑𝑎𝑦𝑠
  • 19. Solutions Automation vs Manual • Recurrence • Avoid complicated single use test cases • Focus on repeatable testing paths • Ensure modularization of test data sets • Test Data Sets • Consider automation tool’s assigned hardware resources and performance which must be able to handle the load of the data set under test • Include time needed to prepare environments into your testing estimates • Database Performance • Set expectations on database hardware & responsiveness. • SQL query response time will factor into overall test run times
  • 20. Solutions How do I test data in my cloud environment ? • On-Prem vs Cloud o Follow the same testing methodologies but with considerations for cloud connections and scalability o If an automated solution is being pursued, confirm the tools involved allows for connectivity to your cloud environment • Hybrid-Could Mapping o Interface documentation o Define entry & exit points if applicable • Digital Transformation o Clearly defined conversion requirements and mappings • Environment Scalability • Define limitations on testing environment resources
  • 21. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  • 22. Data Validation Assessment What are the goals of a Data Validation assessment? • Receive an expert evaluation of your current data validation process • Provide recommendations on how to improve your process • Proposal for successful implementation of your goals
  • 23. Data Validation Assessment Components of the Assessment • Business analysis • Data architecture analysis • ETL testing process evaluation • DataOps & DevOps evaluation • Resource evaluation (optional) • Metrics evaluation • Risk assessment
  • 24. Data Validation Assessment Interview with Key Players • Business/Data Analysts create requirements • QA Testers develop and execute test plans and test cases • Architects set up environments • Developers create ETL code, perform unit tests • DBAs test for performance and stress • Business Users perform functional User Acceptance Tests
  • 25. Data Validation Assessment Process Review • Review Requirements & Mapping documentation • Testing Process Design • Analysis of tools and DevOps/DataOps • Reporting metrics evaluations
  • 26. Data Validation Assessment Deliverables • Detailed analysis report with recommendations for improvement • Presentation to your team on our findings • Proposal for successful implementation of your goals
  • 27. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  • 28. ETL Developer: Codes data movement based on Mapping Requirements Data Warehouse ETL Data Tester: Tests data movement based on Mapping Requirements Data Mart ETL Source Data Big Data lake Testing Point #1 Testing Point #2 Testing Points #3 BI & Analytics Testing Point #4 Tester tests BI Reports BI Analyst extracts data for reports Data Testing - Developer & Tester
  • 29. Source-to-Target Map It’s the critical element required to efficiently plan the target Data Stores. It also defines the Extract, Transform, Load (ETL) process. Intention: ✓ capture business rules ✓ data flow mapping and ✓ data movement requirements. Mapping Doc specifies: ▪ Source input definition ▪ Target/output details ▪ Business & data transformation rules ▪ Absolute data quality requirements ▪ Optional data quality requirements. Data Requirements = Mapping Document
  • 30. Data Testing Strategies Testing Methods Minus Queries – Create a SQL source query and a SQL Target query. Utilizing SQL, subtract source query results from target query results and subtract target query results from source query results Visual Compare – View source data and target data and manually compare Record Counts – Creating a SQL source and target query to return a record counts and comparing the values Automation – Utilizing an automation tool to compare SQL source and target query results
  • 31. Sampling Level 1 Sampling a % of data by visually comparing data sets. Not repeatable. Excel, Ad Hoc Reporting Level 2 Using Excel or other homegrown method. Ad hoc reporting. Minus Queries Level 3 Utilizing SQL editor & minus queries to test data. More detailed reporting. Data Test Automation Level 4 Repeatable test automation, agreed-upon process, centralized reporting. On which Level should your process be? Data Quality Optimizing Level 5 Full automation, tracking of ROI, predictive data issues, auditable results. Business value is fully understood/supported by management. Data Maturity Model - Test Execution
  • 32. Data Validation Data Testing Strategies Intro a software division of Assessment Case Study Data Validation Assessment by
  • 33. A company in the financial industry had a development and QA team assigned to their ETL process. But there were still issues: Case Study • They were still suffering from incorrect data fields populating their Business Intelligence (BI) reports • Development cycles were frequently delayed • Management was losing confidence in the BI reporting data CASE STUDY OVERVIEW
  • 34. Senior RTTS resources were brought in to assess the process • Interview key players • Review process documentation and tools • Minimal requirements • Ticketing system was not being implemented for traceability • Testing process of low-level maturity o Table row counts o Sampling o Excel comparisons Problem areas identified: Case Study Resource needs:
  • 35. Case Study Recommendations for Improvement • Centralized mapping documentation o Linking requirements to work items tickets to test cases. • Improve communications between team members we recommended a new Data Analyst role • Narrowed focus of the stand-up meetings • Implemented automated solutions to expand coverage for larger data sets
  • 36. DEMO: Automating your data validation & testing
  • 37. Any questions? Creating a Data Validation & Testing Strategy