SlideShare a Scribd company logo
A Tree Kernel Based
Approach for
Clone Detection
1) University of Naples Federico II
2) University of Basilicata
Anna Corazza1
, Sergio Di Martino1
,
Valerio Maggio1
, Giuseppe Scanniello2
Outline
►Background
○ Clone detection definition
○ State of the Art Techniques Taxonomy
►Our Abstract Syntax Tree based Proposal
○ A Tree Kernel based approach for clone detection
►A preliminary evaluation
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
► Program Text can be further distinguished by their degree of similarity1
○ Type 1 Clone: Exact Copy
○ Type 2 Clone: Parameter Substituted Clone
○ Type 3 Clone: Modified/Structure Substituted Clone
1. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
State of the Art Techniques
► Classified in terms of Program Text representation2
○ String, token, syntax tree, control structures, metric vectors
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
2
2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
State of the Art Techniques
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
► Combined Techniques (a.k.a. Hybrid)
○Combine different representations
○Combine different techniques
○Combine different sources of information
●Tree Kernel based approach (Our approach :)
2
The Proposed Approach
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
► As a measure we propose the use of a
(Tree) Kernel Function
3
Kernels for Structured Data
► Kernels are a class of functions with many appealing features:
○ Are based on the idea that a complex object can be described in terms of
its constituent parts
○ Can be easily tailored to a specific domain
► There exist different classes of Kernels:
○ String Kernels
○ Graph Kernels
○ …
○ Tree Kernels
● Applied to NLP Parse Trees (Collins and Duffy 2004)
4
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of
compared trees
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
(3) A proper Kernel Function to compare subparts of
trees
5
(1) The defined features
► We annotate each node of AST by 4 features:
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is
enclosed
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is enclosed
○ Lexemes
● Lexical information within the code
6
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Lexemes Feature
► For leaf nodes:
○ It is the lexeme associated to the node
► For internal nodes:
○ It is the set of lexemes that recursively comes from
subtrees with minimum height
8
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x 0
x y
y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
y, return
9
(2) Applying features in a Kernel
We exploits these features to compute similarity among pairs of
nodes, as follows:
► Instruction Class filters comparable nodes
○ We compare only nodes with the same Instruction Class
► Instruction, Context and Lexemes are used to define a value of
similarity between compared nodes
10
(Primitive) Kernel Function between nodes
1.0 If two nodes have the same values of
features
0.8 If two nodes differ in lexemes
(same instruction and context)
0.7 If two nodes share lexemes and are
the same instruction
0.5 If two nodes share lexemes and are
enclosed in the same context
0.25 If two nodes have at least one feature
in common
0.0 no match
s(n1,n2)=
11
(3) Tree Kernel: Kernel on entire Tree Structures
►We apply nodes comparison recursively to compute
similarity between subtrees
►We aim to identify the maximum isomorphic
tree/subtree
12
Overall Process
1. Preprocessing 2. Extraction
3. Match Detection 4. Aggregation
13
A Preliminary evaluation
Evaluation Description
► We considered a small Java software system
○ We choose to identify clones at method level
► We checked system against the presence of up to Type 3 clones
○ Removed all detected clones through refactoring operations
► We manually and randomly injected a set of artificially created clones
○ One set for each type of clones
► We applied our prototype and CloneDigger* to mutated systems
► We evaluated performances in terms of Precision, Recall and F1
*http://clonedigger.sourceforge.net/
14
Results (1)
► Type 1 and Type 2 Clones:
○ We were able to detect all clones without any false
positive
○ This was obtained also by CloneDigger
○ Both tools expressed the potential of AST-based
approaches
15
Results (2)
► Type 3 clones:
○ We classified results as “true Type 3 clones” according to
different thresholds on similarity values
○ We measured performance on different thresholds
We get best results with
threshold equals to 0.70
16
Conclusions and Future Works
► Measure performance on real systems and projects
○ Bellon's Benchmark
○ Investigate best results with 0.7 as threshold
○ Measure Time Performances
► Improve the scalability of the approach
○ Avoid to compare all pairs
► Improve similarity computation
○ Avoid manual weighting features
► Extend Supported Languages
○ Now we support Java, C, Python
17
Thank you for listening.
Questions?
18

More Related Content

PPT
Chapter 1 Presentation
PPT
Object and class in java
PPTX
Any Which Array But Loose
PPT
Core java by a introduction sandesh sharma
PPTX
Principles of functional progrmming in scala
PDF
Classification using Apache SystemML by Prithviraj Sen
PPTX
JSpiders - Wrapper classes
Chapter 1 Presentation
Object and class in java
Any Which Array But Loose
Core java by a introduction sandesh sharma
Principles of functional progrmming in scala
Classification using Apache SystemML by Prithviraj Sen
JSpiders - Wrapper classes

What's hot (17)

PPTX
Chapter ii(oop)
ODP
Data structures in scala
PDF
Euclideus_Language
PPTX
Java Unit 2(Part 1)
PPTX
Java Unit 2(part 3)
PPTX
Java Unit 2 (Part 2)
PPTX
Vectors in Java
DOCX
Jist of Java
DOCX
JAVA CONCEPTS AND PRACTICES
DOCX
What Do You Mean By NUnit
PDF
Java Serialization Deep Dive
PDF
PPT
Iterator Design Pattern
PDF
LectureNotes-05-DSA
PPSX
Collections - Array List
PDF
Bin Sorting And Bubble Sort By Luisito G. Trinidad
PPT
Serialization/deserialization
Chapter ii(oop)
Data structures in scala
Euclideus_Language
Java Unit 2(Part 1)
Java Unit 2(part 3)
Java Unit 2 (Part 2)
Vectors in Java
Jist of Java
JAVA CONCEPTS AND PRACTICES
What Do You Mean By NUnit
Java Serialization Deep Dive
Iterator Design Pattern
LectureNotes-05-DSA
Collections - Array List
Bin Sorting And Bubble Sort By Luisito G. Trinidad
Serialization/deserialization
Ad

Viewers also liked (20)

PDF
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
PPT
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
PPTX
Method of least square
PDF
Stata statistics
PDF
Dispersion stati
PPTX
Measures of dispersion
PPT
Simple (and Simplistic) Introduction to Econometrics and Linear Regression
PPTX
STATA - Panel Regressions
PPTX
Measures of dispersion
PPT
Regression
PDF
T test and ANOVA
ODP
ANOVA II
ODP
Correlation
PPT
Lesson 8 Linear Correlation And Regression
PPT
Simple linear regression (final)
PDF
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
PPT
Regression analysis ppt
PDF
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
PPTX
Analysis of variance (ANOVA)
PPT
Correlation analysis ppt
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Method of least square
Stata statistics
Dispersion stati
Measures of dispersion
Simple (and Simplistic) Introduction to Econometrics and Linear Regression
STATA - Panel Regressions
Measures of dispersion
Regression
T test and ANOVA
ANOVA II
Correlation
Lesson 8 Linear Correlation And Regression
Simple linear regression (final)
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
Regression analysis ppt
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Analysis of variance (ANOVA)
Correlation analysis ppt
Ad

Similar to A tree kernel based approach for clone detection (20)

PPTX
Image Recognition of recognition pattern.pptx
PDF
Clustering.pdf
PPT
Pattern Recognition and understanding patterns
PPT
Pattern Recognition- Basic Lecture Notes
PPT
PatternRecognition_fundamental_engineering.ppt
PPTX
Text clustering
PPTX
Poggi analytics - clustering - 1
PDF
Multi-Armed Bandits:
 Intro, examples and tricks
ODP
Distributed Coordination
PPTX
Islamic University Pattern Recognition & Neural Network 2019
PPT
cluster analysis
PDF
Lecture 7: Recurrent Neural Networks
PPT
20070702 Text Categorization
PPT
tutorial.ppt
PPTX
Large Scale Data Clustering: an overview
PDF
dbscan clusteringdbscan clusteringdbscan clusteringdbscan clustering.pdf
PDF
Neural Nets Deconstructed
PPT
Clustering in Machine Learning Topic7a.ppt
PPTX
Document clustering and classification
Image Recognition of recognition pattern.pptx
Clustering.pdf
Pattern Recognition and understanding patterns
Pattern Recognition- Basic Lecture Notes
PatternRecognition_fundamental_engineering.ppt
Text clustering
Poggi analytics - clustering - 1
Multi-Armed Bandits:
 Intro, examples and tricks
Distributed Coordination
Islamic University Pattern Recognition & Neural Network 2019
cluster analysis
Lecture 7: Recurrent Neural Networks
20070702 Text Categorization
tutorial.ppt
Large Scale Data Clustering: an overview
dbscan clusteringdbscan clusteringdbscan clusteringdbscan clustering.pdf
Neural Nets Deconstructed
Clustering in Machine Learning Topic7a.ppt
Document clustering and classification

More from ICSM 2010 (14)

PPTX
Scalable Semantic Web-based Source Code Search Infrastructure
PDF
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
PDF
Wiki dev nlp
PDF
iFL: An Interactive Environment for Understanding Feature Implementations
PDF
Using Clone Detection to Identify Bugs in Concurrent Software
PDF
Automatically Repairing Test Cases for Evolving Method Declarations
PDF
Automated Identification of Cross-browser Issues in Web Applications
PDF
Reverse Engineering Object-Oriented Distributed Systems
PPTX
Software asset management
PPTX
Successfulresearch 100915022614-phpapp01
PPTX
Enabling multi tenancy(An Industrial Experience Report)
PDF
Ponsini automatic slides
PDF
Studying the impact of dependency network measures on software quality
PDF
Icsm2010 Announcement
Scalable Semantic Web-based Source Code Search Infrastructure
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
Wiki dev nlp
iFL: An Interactive Environment for Understanding Feature Implementations
Using Clone Detection to Identify Bugs in Concurrent Software
Automatically Repairing Test Cases for Evolving Method Declarations
Automated Identification of Cross-browser Issues in Web Applications
Reverse Engineering Object-Oriented Distributed Systems
Software asset management
Successfulresearch 100915022614-phpapp01
Enabling multi tenancy(An Industrial Experience Report)
Ponsini automatic slides
Studying the impact of dependency network measures on software quality
Icsm2010 Announcement

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
A Presentation on Artificial Intelligence
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Encapsulation theory and applications.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
A Presentation on Artificial Intelligence
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25-Week II
Encapsulation theory and applications.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...

A tree kernel based approach for clone detection

  • 1. A Tree Kernel Based Approach for Clone Detection 1) University of Naples Federico II 2) University of Basilicata Anna Corazza1 , Sergio Di Martino1 , Valerio Maggio1 , Giuseppe Scanniello2
  • 2. Outline ►Background ○ Clone detection definition ○ State of the Art Techniques Taxonomy ►Our Abstract Syntax Tree based Proposal ○ A Tree Kernel based approach for clone detection ►A preliminary evaluation
  • 3. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 4. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 5. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” ► Program Text can be further distinguished by their degree of similarity1 ○ Type 1 Clone: Exact Copy ○ Type 2 Clone: Parameter Substituted Clone ○ Type 3 Clone: Modified/Structure Substituted Clone 1. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 6. State of the Art Techniques ► Classified in terms of Program Text representation2 ○ String, token, syntax tree, control structures, metric vectors ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... 2 2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
  • 7. State of the Art Techniques ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... ► Combined Techniques (a.k.a. Hybrid) ○Combine different representations ○Combine different techniques ○Combine different sources of information ●Tree Kernel based approach (Our approach :) 2
  • 9. The Goal ► Define an AST based technique able to detect up to Type 3 Clones 3
  • 10. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information 3
  • 11. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information ► As a measure we propose the use of a (Tree) Kernel Function 3
  • 12. Kernels for Structured Data ► Kernels are a class of functions with many appealing features: ○ Are based on the idea that a complex object can be described in terms of its constituent parts ○ Can be easily tailored to a specific domain ► There exist different classes of Kernels: ○ String Kernels ○ Graph Kernels ○ … ○ Tree Kernels ● Applied to NLP Parse Trees (Collins and Duffy 2004) 4
  • 13. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees 5
  • 14. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes 5
  • 15. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes (3) A proper Kernel Function to compare subparts of trees 5
  • 16. (1) The defined features ► We annotate each node of AST by 4 features: 6
  • 17. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... 6
  • 18. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... 6
  • 19. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed 6
  • 20. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed ○ Lexemes ● Lexical information within the code 6
  • 21. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 22. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 23. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 24. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 25. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 26. Lexemes Feature ► For leaf nodes: ○ It is the lexeme associated to the node ► For internal nodes: ○ It is the set of lexemes that recursively comes from subtrees with minimum height 8
  • 31. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return 9
  • 32. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return y, return 9
  • 33. (2) Applying features in a Kernel We exploits these features to compute similarity among pairs of nodes, as follows: ► Instruction Class filters comparable nodes ○ We compare only nodes with the same Instruction Class ► Instruction, Context and Lexemes are used to define a value of similarity between compared nodes 10
  • 34. (Primitive) Kernel Function between nodes 1.0 If two nodes have the same values of features 0.8 If two nodes differ in lexemes (same instruction and context) 0.7 If two nodes share lexemes and are the same instruction 0.5 If two nodes share lexemes and are enclosed in the same context 0.25 If two nodes have at least one feature in common 0.0 no match s(n1,n2)= 11
  • 35. (3) Tree Kernel: Kernel on entire Tree Structures ►We apply nodes comparison recursively to compute similarity between subtrees ►We aim to identify the maximum isomorphic tree/subtree 12
  • 36. Overall Process 1. Preprocessing 2. Extraction 3. Match Detection 4. Aggregation 13
  • 38. Evaluation Description ► We considered a small Java software system ○ We choose to identify clones at method level ► We checked system against the presence of up to Type 3 clones ○ Removed all detected clones through refactoring operations ► We manually and randomly injected a set of artificially created clones ○ One set for each type of clones ► We applied our prototype and CloneDigger* to mutated systems ► We evaluated performances in terms of Precision, Recall and F1 *http://clonedigger.sourceforge.net/ 14
  • 39. Results (1) ► Type 1 and Type 2 Clones: ○ We were able to detect all clones without any false positive ○ This was obtained also by CloneDigger ○ Both tools expressed the potential of AST-based approaches 15
  • 40. Results (2) ► Type 3 clones: ○ We classified results as “true Type 3 clones” according to different thresholds on similarity values ○ We measured performance on different thresholds We get best results with threshold equals to 0.70 16
  • 41. Conclusions and Future Works ► Measure performance on real systems and projects ○ Bellon's Benchmark ○ Investigate best results with 0.7 as threshold ○ Measure Time Performances ► Improve the scalability of the approach ○ Avoid to compare all pairs ► Improve similarity computation ○ Avoid manual weighting features ► Extend Supported Languages ○ Now we support Java, C, Python 17
  • 42. Thank you for listening. Questions? 18