SlideShare a Scribd company logo
Revisiting the Notion of Diversity
in Software Testing
Lionel Briand
SBFT 2023 Keynote
http://www.lbriand.info
Why Diversity?
‱ Diverse test cases
‱ Exercising the system to the largest extent possible within a
budget
‱ Increase probability of fault detection
‱ While working with incomplete knowledge
‱ Cost of acquiring information
‱ Missing information
2
Example: Fuzzing with AFL
3
Diversity mechanisms: Mutation, coverage
Credits: Antonio Morales, https://github.com/antonio-morales/Fuzzing101
Aspects of Diversity
4
SUT
Inputs Outputs
Execution (internal):
- Structural coverage
- Model coverage (e.g., states)
Questions
‱ What aspects of diversity to focus on?
‱ Information access
‱ Information cost, e.g., execution time
‱ Context-dependent
‱ How to measure diversity?
‱ Representation (e.g., inputs)
‱ Distance measure, e.g., cosine, edit
‱ Computational cost
‱ Guidance, e.g., in search
‱ How to maximize diversity?
‱ Mutation, metaheuristic search, symbolic execution 

‱ Issues: cost, scalability, bias, effectiveness
5
Aspects of Diversity
‱ Inputs: No instrumentation, does not require the execution of
the SUT
‱ Outputs: No instrumentation, execution required but directly
characterizes the behavior of the SUT
‱ Internal SUT structure: Instrumentation, possibly modeling,
additional execution cost and significant data storage
6
Example: Testing DNNs
‱ Redundant or invalid inputs
‱ Labeling cost is high
‱ Domain-specific knowledge is
required to manually label test
inputs
‱ Cost of test execution can be
high
‱ Coverage ineffective
‱ Test selection based on inputs
7
Aghababaeyan et al., 2023
Example: Testing DNNs
We want to test a DNN model with a fixed test budget.
‱ How can we automatically select a candidate test subset with high-fault
revealing power to test DNNs?
‱ Black-box test selection based on input diversity.
8
Black-box test
selection method
Test inputs T Subset
S⊆T
Example: Testing DNNs
‱ No model execution
‱ No access to model internals or training set
‱ Studies show that proposed coverage measures for DNNs
not associated with faults
‱ Solution: Geometric diversity of image features
9
Extracting Image Features
‱ VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
10
Features:
- Activation values
after last convolutional
layer
- Characterize semantic
elements such as shapes
and colors
Geometric Diversity (GD)
‱ Given a dataset X and its corresponding feature vectors V,
the geometric diversity of a subset S ⊆ X is defined as the
hyper-volume of the parallelepiped spanned by the rows of
Vs, i.e., feature vectors of items in S, where the larger the
volume, the more diverse is the feature space of S
11
Aghababaeyan et al., 2023
Measuring Diversity
‱ Representation and measure: Construct validity?
‱ Cost of computing diversity
‱ Guidance provided by diversity, e.g., test selection search
12
Example: Test Minimization
‱ Permanently remove redundant test cases in a test suite that are
unlikely to detect new faults
‱ Black-box versus white-box techniques
‱ FAST-R: Quick and black-box, but low fault detection rates
‱ ATM: Abstract Syntax Tree (AST)-based Test case Minimizer
‱ Motivation: Achieve a better trade-off between effectiveness and
efficiency than FAST-R
‱ Context: Minimization only applied to major releases
13
Example: ATM
‱ Representation: AST of pre-processed test code
‱ Tree similarity measures: top-down, bottom-up, combined, edit distance
‱ Common subtree isomorphism algorithms
‱ Top-down and bottom-up emphasize different aspects of similarity
between ASTs
14
Transform test code
to ASTs
Test Suite
Measure test
case similarity
Run search
algorithms
Minimized test
suite
Pre-process test
code
4 tree-based similarity
measures
GA & NSGA-II
Pan et al., 2023
Example: ATM
‱ Alternatives evaluated in terms of Fault Detection Rate (FDR)
‱ Edit distance is expensive but offers good guidance
‱ Combined similarity not significantly different
‱ Multi-objective search more expensive
‱ Much higher fault detection than FAST-R in significantly higher execution
time, though practical up to an extent
15
GA NSGA-II
Top-Down Bottom-Up Combined Tree Edit Distance Top-Down & Bottom-Up Combined & Tree Edit
Distance
FDR
0.78
Time
70.87
FDR
0.74
Time
67.05
FDR
0.80
Time
72.75
FDR
0.81
Time
82.23
FDR
0.78
Time
235.41
FDR
0.82
Time
258.44
Example: Input Diversity in DNNs
‱ Alternative diversity measures: Geometrics Diversity,
Normalized Compression Distance (NCD), standard deviation
‱ Construct validity?
‱ Analysis:
‱ We study how diversity scores change while varying the number of classes or
concepts inside the images of the input sets.
‱ We assume that diversity scores should increase with the number of classes or
concepts that are present in an input set.
16
Example: Input Diversity in DNNs
‱ Geometrics Diversity shows a clear monotonic relationship
with the number of classes in the input set
17
11
(a) Evolution of GD on Cifar-10 (b) Evolution of STD on Cifar-10 (c) Evolution of NCD on Cifar-10
(d) Evolution of GD on MNIST (e) Evolution of STD on MNIST (f) Evolution of NCD on MNIST
Figure 8: Evolution of the diversity scores for input sets from Cifar-10 and MNIST. Each boxplot shows the distribution of
diversity scores of 20 input sets of size 100.
This article has been accepted for publication in IEEE Transactions on Software Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSE.2023.3243522
Aghababaeyan et al., 2023
Applications
‱ Test minimization, selection, prioritization
‱ Mutation analysis
‱ Identify boundaries in the input space, e.g., safe vs unsafe
18
MASS: CPS Mutation Testing
19
Create mutants Compile mutants
Killed Mutants
Live Mutants
2
Collect test data
1
Code
Coverage
Remove equivalent/duplicate
based on compiler optimizations
4
3
Mutants
Code coverage
Mutants successfully
compiled
Unique mutants
Evaluate mutation
score’s confidence
Sampled mutants
Sample mutants
Execute prioritized
subset of test cases
5
6 7
Cornejo et al., 2021
‱ Selection and prioritization of test
cases based on statement coverage
‱ Test suite prioritization:
‱ Greedy algorithm
‱ Select first the test case that
largely differ from the most
similar, already selected,
test case
‱ Test suite reduction: exclude test
cases with perfect similarity
MASS: CPS Mutation Testing
‱ Compare the sets of source code
statements that have been
covered by test cases: Jaccard
and Ochiai
‱ Compare the number of times
each statement has been covered
by test cases: Euclidian, cosine
‱ Focus on functions in source file
where mutated statement located
‱ Best: Cosine distance
‱ Difference in mutation score < 5%
20
Create mutants Compile mutants
Killed Mutants
Live Mutants
2
Collect test data
1
Code
Coverage
Remove equivalent/duplicate
based on compiler optimizations
4
3
Mutants
Code coverage
Mutants successfully
compiled
Unique mutants
Evaluate mutation
score’s confidence
Sampled mutants
Sample mutants
Execute prioritized
subset of test cases
5
6 7
Reduction in mutation
analysis time > 70%
Explanations for DNN Errors (SEDE)
Can we explain DNN failures of real-world images
using simulator parameters?
21
Training Set
Simulator
Images
DNN
Training
Test Set
Simulator
Images
DNN
Testing
DNN
Training
(fine-tuning)
Training Set
Real-world
Images
DNN
Testing
Trained
DNN
Fine-Tuned
DNN
Real-world Error
Inducing Images
Test Set
Real-world
Images
SEDE
22
Real-world
Error-inducing images HUDD
Evolutionary
Algorithms
Simulator
Simulator
images
Configuration
Parameters
RCC Prototype Images
Step 1. Identify root-cause clusters (RCCs)
Step 2. Generate images associated to RCCs
RCCs
Step 2.1. Identify RCC Prototype Images
Step 2.2. Generate a set of unsafe images belonging to the cluster
Step 2.3. Generate one safe image for each unsafe image
PaiR
Error-inducing
Test Set images
Step1.
Heatmap
based
clustering
Root cause clusters
C1 C2 C3 Step 2. Inspection of subset
of cluster elements.
HUDD: Fahmy et al. 2021
Cluster 2
(near closed eyes)
incomplete training set
Cluster 1
(angle ~157.5)
borderline cases
SEDE
23
Synthetic
images
Parameters-based Description
Improved
DNN
model
Retraining: +18.6%
Real-World
Images
HeadPosex > 10
& HeadPosey > 50.34
Real-world images
Diverse simulator images, within the cluster
Diverse failing simulator images, close to these images
Passing simulator images, close to failing ones in cluster
S1
S2
S3
Cluster
Process: Simulator-based Explanations for DNN Errors (SEDE)
Real-world
Error-inducing images HUDD
Evolutionary
Algorithms
Simulator
Simulator
images
Configuration
Parameters
RCC Prototype Images
Step 1. Identify root-cause clusters (RCCs)
Step 2. Generate images associated to RCCs
RCCs
Step 2.1. Identify RCC Prototype Images
Step 2.2. Generate a set of unsafe images belonging to the cluster
Step 2.3. Generate one safe image for each unsafe image
PaiR
Fahmy et al. 2022
WCET for Critical Tasks
24
‱ Real-time systems
‱ Schedulability analysis verifies time constraints for critical
tasks
‱ Early schedulability analysis and design decisions require
early task Worst Case Execution Time (WCET) estimates
‱ Challenges: Tasks not fully implemented, worst case
inputs unknown
‱ Goal: Estimating Probabilistic Safe WCET Ranges at Design
Stages
SAFE: WCET boundaries
‱ Safe WCET boundaries: implementation objectives, evaluate design options
‱ Iterative, distance-based sampling of WCET values within ranges
25
Phase 1. Worst-case task arrivals analysis Phase 2. Safe WCET computation
Training dataset
Worst-case
sequences of
task arrivals
Task
descriptions
Search Learning
Safe
Unsafe
WCET T1
WCET T2
SAFE: Safe WCET Analysis method For real-time
task schEdulability (Lee et al., 2022)
Use of Language Models
‱ Code (e.g., test) or trace vector representation (encoding)
‱ Benefit from pre-trained language models, e.g., CodeBERT
‱ Example: test case prioritization (Test2Vec, Jabbar et al., 2022), minimization
‱ Embedding test execution traces with fine-tuned CodeBERT
‱ Fined-tuned with pass/fail labels for past test cases in a system
‱ Test2Vec maps test execution traces, i.e., sequences of method calls with their
inputs and return values, to fixed-length, numerical vectors
‱ Heuristic: Similarity to previous failing test cases in the same project
26
Test2Vec Architecture
Preprocessing
(abstraction)
Embedding
Prediction
(prioritization)
27
Conclusions
‱ Many applications of diversity in testing
‱ Various aspects warrant different solutions: information access,
execution cost, instrumentation cost
‱ Trade-off between representations (test cases) and
distance/similarity measures: computation cost, guidance
‱ Determining the best solution can only be done empirically, in a
well-defined (application) context
‱ Check assumptions and properties of distance/similarity
measures, e.g., desired sensitivity to change
‱ Scalability is usually the stumbling block for many applications
28
Selected References
‱ Cornejo et al., “Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results in the Space Domain”, IEEE Transactions on
Software Engineering, 2021
‱ Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions on
Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021
‱ Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”, ACM TOSEM,
2022
‱ Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ACM TOSEM, 2022
‱ Pan et al., , “ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search”, IEEE /ACM ICSE 2023,
‱ Lee et al., “Estimating probabilistic safe wcet ranges of real-time systems at design stages”, ACM TOSEM 2022
‱ Jabbar et al., Test2Vec: An Execution Trace Embedding for Test Case Prioritization, ArXIV, 2022
‱ Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, IEEE Transactions on Software
Engineering, 2023
‱ Aghababaeyan et al., “DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks”,
https://arxiv.org/abs/2303.04878
29
Looking for Postdocs!
Lionel Briand
SBFT 2023 Keynote
http://www.lbriand.info

More Related Content

PDF
Multi Master PostgreSQL Cluster on Kubernetes
PDF
YugabyteDB - Distributed SQL Database on Kubernetes
 
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
PDF
Introduction to DevOps
PPTX
Allyourbase
PDF
RedisConf18 - Redis Memory Optimization
PDF
Mastering PostgreSQL Administration
PDF
2021.02 new in Ceph Pacific Dashboard
Multi Master PostgreSQL Cluster on Kubernetes
YugabyteDB - Distributed SQL Database on Kubernetes
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Introduction to DevOps
Allyourbase
RedisConf18 - Redis Memory Optimization
Mastering PostgreSQL Administration
2021.02 new in Ceph Pacific Dashboard

What's hot (20)

PDF
Jenkins Tutorial.pdf
PPTX
Industrialisation du processus de livraison et pratiques DevOps avec Kubernet...
PPTX
Kubernetes
PDF
MySQL on AWS RDS
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Decomposing Applications for Scalability and Deployability (April 2012)
PDF
Elk - An introduction
PPTX
Everything You Need To Know About Persistent Storage in Kubernetes
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PDF
Some Iceberg Basics for Beginners (CDP).pdf
PDF
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PDF
MinIO January 2020 Briefing
PDF
Mutiny + quarkus
PDF
Api observability
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Google Cloud Dataflow
PDF
Write Faster SQL with Trino.pdf
PDF
VictoriaLogs: Open Source Log Management System - Preview
Jenkins Tutorial.pdf
Industrialisation du processus de livraison et pratiques DevOps avec Kubernet...
Kubernetes
MySQL on AWS RDS
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Decomposing Applications for Scalability and Deployability (April 2012)
Elk - An introduction
Everything You Need To Know About Persistent Storage in Kubernetes
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
Some Iceberg Basics for Beginners (CDP).pdf
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
MinIO January 2020 Briefing
Mutiny + quarkus
Api observability
Processing Semantically-Ordered Streams in Financial Services
Massive Data Processing in Adobe Using Delta Lake
Google Cloud Dataflow
Write Faster SQL with Trino.pdf
VictoriaLogs: Open Source Log Management System - Preview
Ad

Similar to Revisiting the Notion of Diversity in Software Testing (20)

PDF
Automated Testing and Safety Analysis of Deep Neural Networks
PDF
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
PPTX
PgVector + : Enable Richer Interaction with vector database.pptx
PDF
Dissertation Data Fusion Summary Poster
PDF
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
PDF
Testing Machine Learning-enabled Systems: A Personal Perspective
PDF
Overview of DuraMat software tool development
PDF
Automated Testing of Autonomous Driving Assistance Systems
PDF
Measuring the Validity of Clustering Validation Datasets
PPT
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
PDF
Data_Prep_Techniques_Challenges_Methods.pdf
PPT
Ensemble Learning Featuring the Netflix Prize Competition and ...
 
PPTX
The relationship between test and production code quality (@ SIG)
PPTX
Compeition-Level Code Generation with AlphaCode.pptx
PDF
Scalable Software Testing and Verification of Non-Functional Properties throu...
PDF
Data Generation with PROSPECT: a Probability Specification Tool
PPT
Thesis Giani UIC Slides EN
PPTX
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
PDF
To bag, or to boost? A question of balance
PPT
deep_Visualization in Data mining.ppt
Automated Testing and Safety Analysis of Deep Neural Networks
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
PgVector + : Enable Richer Interaction with vector database.pptx
Dissertation Data Fusion Summary Poster
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Testing Machine Learning-enabled Systems: A Personal Perspective
Overview of DuraMat software tool development
Automated Testing of Autonomous Driving Assistance Systems
Measuring the Validity of Clustering Validation Datasets
Wearable Computing - Part IV: Ensemble classifiers & Insight into ongoing res...
Data_Prep_Techniques_Challenges_Methods.pdf
Ensemble Learning Featuring the Netflix Prize Competition and ...
 
The relationship between test and production code quality (@ SIG)
Compeition-Level Code Generation with AlphaCode.pptx
Scalable Software Testing and Verification of Non-Functional Properties throu...
Data Generation with PROSPECT: a Probability Specification Tool
Thesis Giani UIC Slides EN
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
To bag, or to boost? A question of balance
deep_Visualization in Data mining.ppt
Ad

More from Lionel Briand (20)

PDF
LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on...
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
PDF
Automated Test Case Repair Using Language Models
PDF
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categorie...
PDF
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
PDF
Precise and Complete Requirements? An Elusive Goal
PDF
Large Language Models for Test Case Evolution and Repair
PDF
Metamorphic Testing for Web System Security
PDF
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
PDF
Fuzzing for CPS Mutation Testing
PDF
Data-driven Mutation Analysis for Cyber-Physical Systems
PDF
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
PDF
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
PDF
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
PDF
PRINS: Scalable Model Inference for Component-based System Logs
PDF
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
PDF
Reinforcement Learning for Test Case Prioritization
PDF
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
PDF
On Systematically Building a Controlled Natural Language for Functional Requi...
PDF
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on...
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Automated Test Case Repair Using Language Models
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categorie...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Precise and Complete Requirements? An Elusive Goal
Large Language Models for Test Case Evolution and Repair
Metamorphic Testing for Web System Security
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Fuzzing for CPS Mutation Testing
Data-driven Mutation Analysis for Cyber-Physical Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
PRINS: Scalable Model Inference for Component-based System Logs
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Reinforcement Learning for Test Case Prioritization
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
On Systematically Building a Controlled Natural Language for Functional Requi...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Cost to Outsource Software Development in 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Nekopoi APK 2025 free lastest update
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Systems & Binary Numbers (comprehensive )
Upgrade and Innovation Strategies for SAP ERP Customers
Cost to Outsource Software Development in 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Nekopoi APK 2025 free lastest update
Odoo POS Development Services by CandidRoot Solutions
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Understanding Forklifts - TECH EHS Solution
PTS Company Brochure 2025 (1).pdf.......
Odoo Companies in India – Driving Business Transformation.pdf
Digital Strategies for Manufacturing Companies
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why Generative AI is the Future of Content, Code & Creativity?
Softaken Excel to vCard Converter Software.pdf
CHAPTER 2 - PM Management and IT Context

Revisiting the Notion of Diversity in Software Testing

  • 1. Revisiting the Notion of Diversity in Software Testing Lionel Briand SBFT 2023 Keynote http://www.lbriand.info
  • 2. Why Diversity? ‱ Diverse test cases ‱ Exercising the system to the largest extent possible within a budget ‱ Increase probability of fault detection ‱ While working with incomplete knowledge ‱ Cost of acquiring information ‱ Missing information 2
  • 3. Example: Fuzzing with AFL 3 Diversity mechanisms: Mutation, coverage Credits: Antonio Morales, https://github.com/antonio-morales/Fuzzing101
  • 4. Aspects of Diversity 4 SUT Inputs Outputs Execution (internal): - Structural coverage - Model coverage (e.g., states)
  • 5. Questions ‱ What aspects of diversity to focus on? ‱ Information access ‱ Information cost, e.g., execution time ‱ Context-dependent ‱ How to measure diversity? ‱ Representation (e.g., inputs) ‱ Distance measure, e.g., cosine, edit ‱ Computational cost ‱ Guidance, e.g., in search ‱ How to maximize diversity? ‱ Mutation, metaheuristic search, symbolic execution 
 ‱ Issues: cost, scalability, bias, effectiveness 5
  • 6. Aspects of Diversity ‱ Inputs: No instrumentation, does not require the execution of the SUT ‱ Outputs: No instrumentation, execution required but directly characterizes the behavior of the SUT ‱ Internal SUT structure: Instrumentation, possibly modeling, additional execution cost and significant data storage 6
  • 7. Example: Testing DNNs ‱ Redundant or invalid inputs ‱ Labeling cost is high ‱ Domain-specific knowledge is required to manually label test inputs ‱ Cost of test execution can be high ‱ Coverage ineffective ‱ Test selection based on inputs 7 Aghababaeyan et al., 2023
  • 8. Example: Testing DNNs We want to test a DNN model with a fixed test budget. ‱ How can we automatically select a candidate test subset with high-fault revealing power to test DNNs? ‱ Black-box test selection based on input diversity. 8 Black-box test selection method Test inputs T Subset S⊆T
  • 9. Example: Testing DNNs ‱ No model execution ‱ No access to model internals or training set ‱ Studies show that proposed coverage measures for DNNs not associated with faults ‱ Solution: Geometric diversity of image features 9
  • 10. Extracting Image Features ‱ VGG16 is a convolutional neural network trained on a subset of the ImageNet dataset, a collection of over 14 million images belonging to 22,000 categories. 10 Features: - Activation values after last convolutional layer - Characterize semantic elements such as shapes and colors
  • 11. Geometric Diversity (GD) ‱ Given a dataset X and its corresponding feature vectors V, the geometric diversity of a subset S ⊆ X is defined as the hyper-volume of the parallelepiped spanned by the rows of Vs, i.e., feature vectors of items in S, where the larger the volume, the more diverse is the feature space of S 11 Aghababaeyan et al., 2023
  • 12. Measuring Diversity ‱ Representation and measure: Construct validity? ‱ Cost of computing diversity ‱ Guidance provided by diversity, e.g., test selection search 12
  • 13. Example: Test Minimization ‱ Permanently remove redundant test cases in a test suite that are unlikely to detect new faults ‱ Black-box versus white-box techniques ‱ FAST-R: Quick and black-box, but low fault detection rates ‱ ATM: Abstract Syntax Tree (AST)-based Test case Minimizer ‱ Motivation: Achieve a better trade-off between effectiveness and efficiency than FAST-R ‱ Context: Minimization only applied to major releases 13
  • 14. Example: ATM ‱ Representation: AST of pre-processed test code ‱ Tree similarity measures: top-down, bottom-up, combined, edit distance ‱ Common subtree isomorphism algorithms ‱ Top-down and bottom-up emphasize different aspects of similarity between ASTs 14 Transform test code to ASTs Test Suite Measure test case similarity Run search algorithms Minimized test suite Pre-process test code 4 tree-based similarity measures GA & NSGA-II Pan et al., 2023
  • 15. Example: ATM ‱ Alternatives evaluated in terms of Fault Detection Rate (FDR) ‱ Edit distance is expensive but offers good guidance ‱ Combined similarity not significantly different ‱ Multi-objective search more expensive ‱ Much higher fault detection than FAST-R in significantly higher execution time, though practical up to an extent 15 GA NSGA-II Top-Down Bottom-Up Combined Tree Edit Distance Top-Down & Bottom-Up Combined & Tree Edit Distance FDR 0.78 Time 70.87 FDR 0.74 Time 67.05 FDR 0.80 Time 72.75 FDR 0.81 Time 82.23 FDR 0.78 Time 235.41 FDR 0.82 Time 258.44
  • 16. Example: Input Diversity in DNNs ‱ Alternative diversity measures: Geometrics Diversity, Normalized Compression Distance (NCD), standard deviation ‱ Construct validity? ‱ Analysis: ‱ We study how diversity scores change while varying the number of classes or concepts inside the images of the input sets. ‱ We assume that diversity scores should increase with the number of classes or concepts that are present in an input set. 16
  • 17. Example: Input Diversity in DNNs ‱ Geometrics Diversity shows a clear monotonic relationship with the number of classes in the input set 17 11 (a) Evolution of GD on Cifar-10 (b) Evolution of STD on Cifar-10 (c) Evolution of NCD on Cifar-10 (d) Evolution of GD on MNIST (e) Evolution of STD on MNIST (f) Evolution of NCD on MNIST Figure 8: Evolution of the diversity scores for input sets from Cifar-10 and MNIST. Each boxplot shows the distribution of diversity scores of 20 input sets of size 100. This article has been accepted for publication in IEEE Transactions on Software Engineering. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TSE.2023.3243522 Aghababaeyan et al., 2023
  • 18. Applications ‱ Test minimization, selection, prioritization ‱ Mutation analysis ‱ Identify boundaries in the input space, e.g., safe vs unsafe 18
  • 19. MASS: CPS Mutation Testing 19 Create mutants Compile mutants Killed Mutants Live Mutants 2 Collect test data 1 Code Coverage Remove equivalent/duplicate based on compiler optimizations 4 3 Mutants Code coverage Mutants successfully compiled Unique mutants Evaluate mutation score’s confidence Sampled mutants Sample mutants Execute prioritized subset of test cases 5 6 7 Cornejo et al., 2021 ‱ Selection and prioritization of test cases based on statement coverage ‱ Test suite prioritization: ‱ Greedy algorithm ‱ Select first the test case that largely differ from the most similar, already selected, test case ‱ Test suite reduction: exclude test cases with perfect similarity
  • 20. MASS: CPS Mutation Testing ‱ Compare the sets of source code statements that have been covered by test cases: Jaccard and Ochiai ‱ Compare the number of times each statement has been covered by test cases: Euclidian, cosine ‱ Focus on functions in source file where mutated statement located ‱ Best: Cosine distance ‱ Difference in mutation score < 5% 20 Create mutants Compile mutants Killed Mutants Live Mutants 2 Collect test data 1 Code Coverage Remove equivalent/duplicate based on compiler optimizations 4 3 Mutants Code coverage Mutants successfully compiled Unique mutants Evaluate mutation score’s confidence Sampled mutants Sample mutants Execute prioritized subset of test cases 5 6 7 Reduction in mutation analysis time > 70%
  • 21. Explanations for DNN Errors (SEDE) Can we explain DNN failures of real-world images using simulator parameters? 21 Training Set Simulator Images DNN Training Test Set Simulator Images DNN Testing DNN Training (fine-tuning) Training Set Real-world Images DNN Testing Trained DNN Fine-Tuned DNN Real-world Error Inducing Images Test Set Real-world Images
  • 22. SEDE 22 Real-world Error-inducing images HUDD Evolutionary Algorithms Simulator Simulator images Configuration Parameters RCC Prototype Images Step 1. Identify root-cause clusters (RCCs) Step 2. Generate images associated to RCCs RCCs Step 2.1. Identify RCC Prototype Images Step 2.2. Generate a set of unsafe images belonging to the cluster Step 2.3. Generate one safe image for each unsafe image PaiR Error-inducing Test Set images Step1. Heatmap based clustering Root cause clusters C1 C2 C3 Step 2. Inspection of subset of cluster elements. HUDD: Fahmy et al. 2021 Cluster 2 (near closed eyes) incomplete training set Cluster 1 (angle ~157.5) borderline cases
  • 23. SEDE 23 Synthetic images Parameters-based Description Improved DNN model Retraining: +18.6% Real-World Images HeadPosex > 10 & HeadPosey > 50.34 Real-world images Diverse simulator images, within the cluster Diverse failing simulator images, close to these images Passing simulator images, close to failing ones in cluster S1 S2 S3 Cluster Process: Simulator-based Explanations for DNN Errors (SEDE) Real-world Error-inducing images HUDD Evolutionary Algorithms Simulator Simulator images Configuration Parameters RCC Prototype Images Step 1. Identify root-cause clusters (RCCs) Step 2. Generate images associated to RCCs RCCs Step 2.1. Identify RCC Prototype Images Step 2.2. Generate a set of unsafe images belonging to the cluster Step 2.3. Generate one safe image for each unsafe image PaiR Fahmy et al. 2022
  • 24. WCET for Critical Tasks 24 ‱ Real-time systems ‱ Schedulability analysis verifies time constraints for critical tasks ‱ Early schedulability analysis and design decisions require early task Worst Case Execution Time (WCET) estimates ‱ Challenges: Tasks not fully implemented, worst case inputs unknown ‱ Goal: Estimating Probabilistic Safe WCET Ranges at Design Stages
  • 25. SAFE: WCET boundaries ‱ Safe WCET boundaries: implementation objectives, evaluate design options ‱ Iterative, distance-based sampling of WCET values within ranges 25 Phase 1. Worst-case task arrivals analysis Phase 2. Safe WCET computation Training dataset Worst-case sequences of task arrivals Task descriptions Search Learning Safe Unsafe WCET T1 WCET T2 SAFE: Safe WCET Analysis method For real-time task schEdulability (Lee et al., 2022)
  • 26. Use of Language Models ‱ Code (e.g., test) or trace vector representation (encoding) ‱ Benefit from pre-trained language models, e.g., CodeBERT ‱ Example: test case prioritization (Test2Vec, Jabbar et al., 2022), minimization ‱ Embedding test execution traces with fine-tuned CodeBERT ‱ Fined-tuned with pass/fail labels for past test cases in a system ‱ Test2Vec maps test execution traces, i.e., sequences of method calls with their inputs and return values, to fixed-length, numerical vectors ‱ Heuristic: Similarity to previous failing test cases in the same project 26
  • 28. Conclusions ‱ Many applications of diversity in testing ‱ Various aspects warrant different solutions: information access, execution cost, instrumentation cost ‱ Trade-off between representations (test cases) and distance/similarity measures: computation cost, guidance ‱ Determining the best solution can only be done empirically, in a well-defined (application) context ‱ Check assumptions and properties of distance/similarity measures, e.g., desired sensitivity to change ‱ Scalability is usually the stumbling block for many applications 28
  • 29. Selected References ‱ Cornejo et al., “Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results in the Space Domain”, IEEE Transactions on Software Engineering, 2021 ‱ Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021 ‱ Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”, ACM TOSEM, 2022 ‱ Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ACM TOSEM, 2022 ‱ Pan et al., , “ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolutionary Search”, IEEE /ACM ICSE 2023, ‱ Lee et al., “Estimating probabilistic safe wcet ranges of real-time systems at design stages”, ACM TOSEM 2022 ‱ Jabbar et al., Test2Vec: An Execution Trace Embedding for Test Case Prioritization, ArXIV, 2022 ‱ Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, IEEE Transactions on Software Engineering, 2023 ‱ Aghababaeyan et al., “DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks”, https://arxiv.org/abs/2303.04878 29
  • 30. Looking for Postdocs! Lionel Briand SBFT 2023 Keynote http://www.lbriand.info