SlideShare a Scribd company logo
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 728
Duplicate Code Detection using Control Statements
Sudhamani M
Department of Computer Science
University of Mysore
Mysore 570006, India
Lalitha Rangarajan
Department of Computer Science
University of Mysore
Mysore 570006, India
Abstract: Code clone detection is an important area of research as reusability is a key factor in software evolution. Duplicate code
degrades the design and structure of software and software qualities like readability, changeability, maintainability. Code clone
increases the maintenance cost as incorrect changes in copied code may lead to more errors. In this paper we address structural code
similarity detection and propose new methods to detect structural clones using structure of control statements. By structure we mean
order of control statements used in the source code. We have considered two orders of control structures: (i) Sequence of control
statements as it appears (ii) Execution flow of control statements.
Keywords: Control statements; Control structure; Execution flow; Similarity value; Structural similarity.
1. INTRODUCTION
Duplicate codes are identical or similar code fragments
present in software program. Two code fragments are similar
if these code segments are similar in their structure of control
statements and similar control flow between control lines [1,
15].
Different types of code clones are [15]
Type 1: Exact similar code fragments except white space and
comments as shown in below example.
Ex 1:
Segment 1:
if(n>0)
{
n=n*1; //multiply by plus 1
}
else
n=n*-1; // multiply by minus 1
Segment 2:
if ( n > 0 )
{
n = n * 1; //multiply by +1
}
else
n = n * -1; // multiply by -1
Type 2: Syntactic similar code fragments except change in
variable, literal and function names.
Ex 2:
Segment 1:
if (n>0)
{
n=n*1; //multiply by plus 1
}
else
n=n*-1; // multiply by minus 1
Segment 2:
if ( m > 0 )
{
m = m * 1; //multiply by +1
}
else
m = m * -1; // multiply by -1
Type 3: Similar code fragments with slight modifications like
reordering/addition/deletion of some statements from already
existing or copied code fragments.
Segment 1: if (n > 0)
{
n=n*1; //multiply by plus 1
}
else
n=n*-1; // multiply by minus 1
Segment 2: if (n > 0)
{
n=n*1; //multiply by plus 1
}
else
n=n*-1; // multiply by minus 1
x=5; //newly added statement
In the above example a new statement x=5 is added.
Type 4: Functionally similar code fragments. Below example
explains recursive and non recursive way of finding factorial
of n. (same program implemented in two ways).
Ex:
Segment 1: int i, j=1, n;
for (i=1; i<=n; i++)
j=j*i;
segment 2:
int fact(int n)
{
if (n == 0) return 1 ;
else return n * fact(n-1) ;
}
Output of program depends on the execution flow of effective
source lines. Execution flow of source lines depends on the
control lines used in the program. Control lines considered
here are iterative statements (for, while and do-while),
conditional statements (if, if-else and switch-case), and
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 729
Fig 1: Different versions of bubble sort program
function call. Here we propose two approaches to find
structural similarity. Approach 1 considers order of control
statements present in the code segments and approach 2
depends on the execution flow of control lines in the program.
Figure 1 shows three different ways of writing bubble sort
program. To find similarity of these programs we compute
control structure metrics. Rest of the paper is organized as
follows. Section 2 covers key literature, section 3 describes
proposed methods and results; section 4 concludes the work
with suggestions on possible future work.
2. RELATED WORK
Duplicate code detection mainly consists of two phases where
first phase is transformation and second phase is comparison.
In transformation phase, source code is transformed in to an
Internal Code Format (ICF). Depending on the ICF
comparison, match detection techniques are classified as
follows [15].
i. String Based: In these techniques source code is considered
as an arrangement of characters/strings/lines and uses string
matching techniques to detect duplicate code [2]. Dup tool
compares lexemes on behalf of string match and finds partial
match [2, 3, 4]. Ducass et al [5] proposed dynamic matching
technique to detect code clones. String based techniques are
simple, language independent and detect type I clones [13, 14,
15, 16].
ii. Token Based: In token based approach source code is
transformed into sequence of tokens using lexer/parser. Then
these sequences of tokens are compared to find duplicate
code. This technique detects both type I and II clones.
Kamiya et al’s [5] CC Finder regenerate source file into a set
of tokens and device single token from these set of tokens and
uses suffix tree substring matching algorithm to detect code
clones. CP Miner uses frequent substring matching algorithm
to replicate tokenized statement. SIM correlate the chain of
tokens using dynamic programming string alignment
technique. Winnowing and JPlag are token based plagiarism
detection tools [13, 14, 15, 16].
iii. Tree Based: Source text is parsed to obtain Abstract
Syntax Tree (AST) or parse tree with appropriate parser. Then
tree matching techniques are used to find similar sub trees.
This approach efficiently detects type I, type II and type III
clones [5, 6]. As AST does not address data flow between
controls, it fails to detect type IV clones. Baxter et al’s
CloneDR find resemblance between programs by matching
sub trees of corresponding source program [15].
iv. Graph Based: Source program is converted into Program
Dependency Graph (PDG) where PDG contains the data flow
and control flow information of the program [6]. Then
isomorphic sub graph detection algorithms are used to find
duplicate code. This technique efficiently identifies all types
of clones. However generating PDG and finding isomorphic
sub graphs is NP hard [8]. Komondoor and Horowitz PDG-
DUP uses program slicing to find isomorphic sub graphs,
Krinke uses iterative approach to detect highest comparable
sub graphs. GPLAG is graph based plagiarism disclosure tool
[11, 16].
v. Metric Based: In this technique different metrics are
computed for code fragments and these metric values are
compared to find duplicate code [9, 10, 11, 12]. AST/PDG
representation can be used to calculate metrics like number of
nodes, number of control edges present in the graph etc. Other
common metrics are number of source lines, number of
function calls, number of local and global variables and
McCabe's cyclomatic complexity etc. eMetric, Covert and
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 730
Moss are metric based tools [15, 16]. Kontogiannis et al. [16]
build an abstract pattern matching tool to identify probable
matches using Markov models to measures similarity between
two programs.
3. PROPOSED METHOD
Here we propose two approaches to find duplicate code. The
different stages in the proposed method are preprocessing,
metric computation, difference matrix computation and
similarity value calculation. Architecture of proposed method
is shown in figure 2 and each stage is explained subsequently.
Preprocessing and template conversion
In preprocessing stage extra space and comments are removed
and input source program is transformed into its standard
intermediate template form. Figure 3 shows the template form
of versions of sort program in figure 1. This template is used
to compute control structure metrics.
Fig 2: Architecture of proposed method
Fig 3: Templates of sort programs in figure 1
Note that the order / structure of control statements are
different across versions. Some versions have function
calls and some don’t. Yet proposed approaches can detect
duplicate to high accuracy.
3.1 Approach 1 – Computation of
similarity using Control Structure Tables
(CSTs)
Control Structure Table (CST): Control Structure Table
contains the information about order of ingrained control
lines used in the program [11]. CST of sort program 1 and
sort program 2 in figure 1 are shown in table 1 and 2.
Table 1. Control structure table for sort program 1
Sl.No
Type of control
statement
Loop Condition
1 Loop 0 0
2 Loop 0 0
3 Loop 1 1
4 Loop 0 1
5 Condition 0 0
6 Loop 0 0
Table 2. Control structure table for sort program 2
Sl.No
Type of control
statement
Loop
Conditio
n
1 Loop 1 1
2 Loop 0 1
3 Condition 0 0
4 Loop 0 0
5 Loop 0 0
Difference Matrix (D) computation: Difference matrix is
calculated using two CSTs. Difference matrix calculated
from table 1 and 2 are shown in table 3. Difference matrix
shows different between all pairs of control statement.
Difference matrix (D) is computed from the respective
control structure tables. A row of program 1
(corresponding to a control statement) is compared with
every row of program 2. Row I and j of the programs are
compared using city block distance formula |Ri1-
Rj1|+|Ri2-Rj2|.
For example first row of table 1 is compared with second
row of table 2 by computing |0-0| + |0-1| =1 is entered in
(1, 2) of distance matrix (table 3). From this table we can
find similar control lines present in two programs. Presence
of zero in a position corresponding to similar control
statement indicates structural similarity of the control
statements in the two programs. For example zero at (3, 1)
in table 3 imply that the iterative statements 3 of program 1
and 1 of program 2 are probably similar. Whereas zero at
(5, 3) is not comparable because the control statements of
the programs are different (fifth control statement of
program 1 is conditional and third control statement of
program 2 is iterative). The zeros that contribute to
similarity are highlighted.
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 731
Table 3. Distance matrix computed from table 1 and 2
Control
lines
Loop
(L)
Loop
(L)
Loop
(L)
Cond
(C)
Loop
(L)
Loop(L) 2 1 0 0 0
Loop (L) 2 1 0 0 0
Loop(L) 0 1 2 2 2
Loop(L) 1 0 1 1 1
Cond(C) 2 1 0 0 0
Loop(L) 2 1 0 0 0
Similarity between codes is found, using the formula
s n if r1=r2
1 2
n
s
r r


otherwise ……. (1)
where r1 and r2 are the number of control lines in
two programs. From table 3 s = 9/1.
We conducted experiments using data set 1 of 5
distinct programs and 15 variants and similarity
values are shown in table 4.
Table 4. Similarity table for data set 1 (s=n/|r1-r2|)
We may observe that in table 4 all programs show highest similarity only with its variants.
3.2 Approach 2: Computation of similarity
using execution flow of control statements
In pre processing stage all functions are placed above the
main function. Function Information
Table (FIT) and CST are generated in a single scan of the
program.
Function Information Table (FIT): FIT gives starting and
ending positions where a particular function begins and ends
in CST. Here function calls are considered as a control lines.
FIT of sort program 2 and 3 are shown in table 5a and 5b.
CSTs of these programs are shown in table 6a and 6b.
Table 5a. Function Information Table (FIT) for sort
program 2
Sl. No Function name Start position End position
1 Sort 1 3
2 Print 4 4
3 main 5 8
Table 5b. Function Information Table (FIT) for sort
program 3
Sl. No Function name Start position End position
1 Sort 1 2
2 Print 3 3
3 Main 4 8
The line 1 (first control statement) of program 2 is function
name ‘sort’ (beginning of function) is entered in FIT of table
5a (refer function name and start position). The control
statements scanned from line 1 onwards are recorded
sequentially in CST (table 6a) until end of the function. The
end of the function namely line 3 is recorded in FIT. Thus in
one scan FIT and CST are generated.
Execution Flow Control Structure Table (EFCST) is
computed using CST and FIT by replacing the function calls
by control lines of that particular function.
Programs
P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v 3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2
P1v 1 0.00 37.00 37.00 37.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20
P1v2 37.00 0.00 37.00 37.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20
P1v3 37.00 37.00 0.00 37.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20
P1v4 37.00 37.00 37.00 0.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20
P2v1 2.29 2.29 2.29 2.29 0.00 8.00 8.00 0.83 0.83 0.83 2.50 2.50 2.50 1.13 1.18
P2 2 2.29 2.29 2.29 2.29 8.00 0.00 8.00 0.83 0.83 0.83 2.50 2.50 2.50 1.13 1.18
P2v3 2.29 2.29 2.29 2.29 8.00 8.00 0.00 0.83 0.83 0.83 2.50 2.50 2.50 1.13 1.18
P3v1 2.23 2.23 2.23 2.23 0.83 0.83 0.83 0.00 199.00 199.00 0.61 0.61 0.61 10.92 13.83
P3v2 2.23 2.23 2.23 2.23 0.83 0.83 0.83 199.00 0.00 199.00 0.61 0.61 0.61 10.92 13.83
P3v3 2.23 2.23 2.23 2.23 0.83 0.83 0.83 199.00 199.00 0.00 0.61 0.61 0.61 10.92 13.83
P4v1 1.11 1.11 1.11 1.11 2.50 2.50 2.50 0.61 0.61 0.61 0.00 4.00 4.00 0.83 0.89
P4v2 1.11 1.11 1.11 1.11 2.50 2.50 2.50 0.61 0.61 0.61 4.00 0.00 4.00 0.83 0.89
P4v3 1.11 1.11 1.11 1.11 2.50 2.50 2.50 0.61 0.61 0.61 4.00 4.00 0.00 0.83 0.89
P5v1 4.22 4.22 4.22 4.22 1.13 1.13 1.13 10.92 10.92 10.92 0.83 0.83 0.83 0.00 161.00
P5v2 4.20 4.20 4.20 4.20 1.18 1.18 1.18 13.83 13.83 13.83 0.89 0.89 0.89 161.00 0.00
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 732
Table 6a. Control Structure Table (Order) for
program 2 in figure 1
Sl. no Control
statement
Loop Condition
1 Loop 1 1
2 Loop 0 1
3 Condition 0 0
4 Loop 0 0
5 Loop 0 0
6 Print 0 0
7 Sort 0 0
8 Print 0 0
Table 6b. Control Structure Table (Order) for
program 3 in figure 1
Sl. no Control
statement
Loop Condition
1 Loop 0 1
2 Condition 0 0
3 Loop 0 0
4 Loop 0 0
5 Print 0 0
6 Loop 0 0
7 Sort 0 0
8 Print 0 0
Execution Flow Control Structure Table (EFCST) of
program 2 is given in table 7. Execution flow starts in
‘main’. From FIT we see that flow starts at line 5 and ends
at line 8. The entries in these lines are copied in EFCST.
However if function call is present, FIT is referred as
corresponding control lines of the function from the
respective beginning and ending lines are copied to EFCST.
The EFCST of programs 1, 2 and 3 in figure 1 are shown in
table 7.
Table 7. EFCST of program 1, 2 and 3
Sl. no Control
statement
Loop Condition
1 Loop 0 0
2 Loop 0 0
3 Loop 1 1
4 Loop 0 1
5 Condition 0 0
6 Loop 0 0
Difference matrix is computed using two EFCSTs as in
section 3.1 and similarity value is computed using formula
1.
We conducted experiments on data set 1 and results are
shown in below table. We conducted experiments on data
set 1 and results are shown in table 8.
Table 8. EFCST and s=n/|r1-r2|
P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2
P1v1
0.00 36.00 36.00 36.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30
P1v2 36.00 0.00 36.00 36.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30
P1v3 36.00 36.00 0.00 36.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30
P1v4 36.00 36.00 36.00 0.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30
P2v1 2.14 2.14 2.14 2.14 0.00 7.00 7.00 0.76 0.76 0.76 2.00 2.00 2.00 1.00 0.88
P2v2 2.14 2.14 2.14 2.14 7.00 0.00 7.00 0.76 0.76 0.76 2.00 2.00 2.00 1.00 0.88
P2v3 2.14 2.14 2.14 2.14 7.00 7.00 0.00 0.76 0.76 0.76 2.00 2.00 2.00 1.00 0.88
P3v1 2.14 2.14 2.14 2.14 0.76 0.76 0.76 0.00 196.00 196.00 0.58 0.58 0.58 13.91 9.83
P3v2 2.14 2.14 2.14 2.14 0.76 0.76 0.76 196.00 0.00 196.00 0.58 0.58 0.58 13.91 9.83
P3v3 2.14 2.14 2.14 2.14 0.76 0.76 0.76 196.00 196.00 0.00 0.58 0.58 0.58 13.91 9.83
P4v1 1.00 1.00 1.00 1.00 2.00 2.00 2.00 0.58 0.58 0.58 0.00 3.00 3.00 0.75 0.63
P4v2 1.00 1.00 1.00 1.00 2.00 2.00 2.00 0.58 0.58 0.58 3.00 0.00 3.00 0.75 0.63
P4v3 1.00 1.00 1.00 1.00 2.00 2.00 2.00 0.58 0.58 0.58 3.00 3.00 0.00 0.75 0.63
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 733
Here also all programs show high similarity only with versions of the same program.
3.3 Similarity computation using CSTs,
EFCSTs and Control Metric Table (CMT)
Control Metric Table (CMT): We compute control metric
table which contains information about total number of
iterative and conditional statements present in the program
[11]. Table 9 shows CMT of data set 1 used for our
experiment.
Table 9. Control Metric Table for data set 1 (CMT)
Sl.
No
Programs
1 2 3 4
L C L C L C L C
1
Beam
search
10 2 10 2 10 2 10 2
2
Bubble
sort
4 1 4 1 4 1 - -
3 Min Max 15 19 15 19 15 19 - -
4
Linear
search
2 1 2 1 - - - -
5 Queue 3 18 3 18 3 18 - -
Computation of similarity value (s): Here similarity
computation is based on CMT as well as CST/EFCST. First
we generate CMT and CST for each program. Difference
matrix (D) is computed from the respective CSTs as explained
in earlier sub sections 3.1 and 3.2.
We compute similarity between programs only if programs
are comparable in terms of number of loops and conditional
statements. While duplicates are created it is unlikely to
expect more than 20 % variation in number of control
statements. Hence a threshold of 20 % variations in these
numbers is fixed for computation of similarity. Suppose
program 1 has x loops and y conditional statements. Program
2 is comparable with program 1 if the number loops and
conditional statements are in the range [x – 20 % (x), x + 20
% (x)] and [y – 20 % (y), y + 20 % (y)]. Table 10 show
computed similarity values with this additional consideration
of CMT.
Table 10a. CST, CMT and s=n/|r1-r2|
P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2
P1v1 0 37 37 37 0 0 0 0 0 0 0 0 0 0 0
P1v2 37 0 37 37 0 0 0 0 0 0 0 0 0 0 0
P1v3 37 37 0 37 0 0 0 0 0 0 0 0 0 0 0
P1v4 37 37 37 0 0 0 0 0 0 0 0 0 0 0 0
P2v1 0 0 0 0 0 8 8 0 0 0 0 0 0 0 0
P2v2 0 0 0 0 8 0 8 0 0 0 0 0 0 0 0
P2v3 0 0 0 0 8 8 0 0 0 0 0 0 0 0 0
P3v1 0 0 0 0 0 0 0 0 199 199 0 0 0 0 0
P3v2 0 0 0 0 0 0 0 199 0 199 0 0 0 0 0
P3v3 0 0 0 0 0 0 0 199 199 0 0 0 0 0 0
P4v1 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0
P4v2 0 0 0 0 0 0 0 0 0 0 4 0 4 0 0
P4v3 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0
P5v1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 161
P5v2 0 0 0 0 0 0 0 0 0 0 0 0 0 161 0
Table 10b. EFCST, CMT and s= n/|r1-r2|
P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2
P1v1 0 36 36 36 0 0 0 0 0 0 0 0 0 0 0
P1v2 36 0 36 36 0 0 0 0 0 0 0 0 0 0 0
P1v3 36 36 0 36 0 0 0 0 0 0 0 0 0 0 0
P1v4 36 36 36 0 0 0 0 0 0 0 0 0 0 0 0
P5v1 3.55 3.55 3.55 3.55 1.00 1.00 1.00 13.91 13.91 13.91 0.75 0.75 0.75 0.00 125.00
P5v2 3.30 3.30 3.30 3.30 0.88 0.88 0.88 9.83 9.83 9.83 0.63 0.63 0.63 125.00 0.00
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 734
P2v1 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0
P2v2 0 0 0 0 7 0 7 0 0 0 0 0 0 0 0
P2v3 0 0 0 0 7 7 0 0 0 0 0 0 0 0 0
P3v1 0 0 0 0 0 0 0 0 196 196 0 0 0 0 0
P3v2 0 0 0 0 0 0 0 196 0 196 0 0 0 0 0
P3v3 0 0 0 0 0 0 0 196 196 0 0 0 0 0 0
P4v1 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0
P4v2 0 0 0 0 0 0 0 0 0 0 3 0 3 0 0
P4v3 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0
P5v1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 125
P5v2 0 0 0 0 0 0 0 0 0 0 0 0 0 125 0
In the above tables similarity is seen only with versions of the same program. All others are 0’s.
3.4 Experimental Results
Five programs, 15 versions data set described in earlier
sections is created in our lab and the experimental results
with two approaches have been discussed in detail in sections
3.1 to 3.3.
For thorough testing of the proposed approaches we
downloaded programs from ‘sourcefoge.net’
(www.sourceforge.net) and ‘f1sourcecode’
(www.f1sourcecode.com) and created many versions by
changing loop statements, reordering control lines and
also by refactoring. These are added to the sample data set
in the earlier sections. Thus we have created 26 distinct
programs and 100 versions data set. To find whether only
versions of the same programs, show higher similarity
when compared to similarities with other programs, we
have done clustering of similarity values using k-means
clustering algorithm with k=2. The clustering is done on
set of similarity value corresponding to one version of a
program (available in a column). The error in duplicate
detection of a program 'j' is found as ratio of number of
misclassification and total number of programs (inclusive
of versions). Total misclassification in program 'j'
includes number of false positives and true negatives.
When a version of program 'j' is clustered with any other
program it is true negative, where as when a version of
program 'i' is clustered with program 'j' it is false positive.
Average error is computed total detection errors in each
program by number of distinct programs. Table 11 shows the
average error with two approaches with and without CMT for
the sample data sets. Also shown in the table the similarity
measurements using the formula s=n/D, where n is similar
number of control lines and 'D' maximum dissimilarity [11].
Table 11. Error table for sample data sets.
Approaches Data structure used Data set1 Data set 2
S= n /D S= n / |r1-r2| S= n / D S=n / |r1-r2|
Approach1
Only CST 0.1465 0.0375 0.5794 0.1038
CST and CMT 0 0 0.00923 0.00577
Approach2
Only EFCSTs 0.04 0.0375 0.0866 0.009615
EFCST and CMT 0 0 0.009615 0.00808
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 735
3.5 Time Complexity
Suppose two programs have n1 and n2 source lines and L1 and
L2 control statements. Note that number of control statements
in a program will be far less than number of source lines
(L<< n). Table 12 shows the detail of major steps in the
computation of similarity and the corresponding complexities.
Table 12. Time complexity table
Hence total time complexity is maximum (θ(n) and
O(L2
)) which is a polynomial time complexity.
3.6 Performance Evaluation
The experiments are done with three available tools
Duplo (uses string matching technique), PMD (uses
tokens to compare) and CloneDR (AST based) and the
results obtained on data set 1 is shown in table 13.
PMD tool shows similarity with user defined function
call and inbuilt function. Control lines for and while,
from figure 1 are not shown as similar. CloneDR is
sometimes sensitive to change in the type of loop
statement.
We divided data set 2 which is used in section 3.4 into
two data sets. First data set has 15 distinct programs and
50 variants. This data set has variation in sequence of
control statements (independent control lines only) in
versions of the same program. Second data set has 11
distinct programs and 50 variants. In this data set
contents of control lines are replaced by function calls
(refer fig 1).
Experiments are conducted on two data sets using two
approaches. Tables 14a and 14b show performance
analysis for proposed methods.
Table 13. Performance analysis table
Sl.
no
Method Error Remarks
1 Duplo 1.8666
All versions of
beam search
show some
similarity with
all versions of
minmax and
bubble sort
programs are not
shown as similar
programs.
2 PMD 1.6
All versions of
beam search
show some
similarity with
all versions of
minmax and
queue programs
are not shown as
similar
programs.
3 Clone DR 1.8666
All versions of
beam search
show some
similarity with
all versions of
minmax and
queue programs
are not shown as
similar
programs.
4
Proposed
Approaches
Only
CST
0.14658
Linear search
and beam search
programs show
similarity with
versions of other
programs
Only
EFCST
0.04
Linear search
program shows
similarity with
bubble sort
programs
CST &
CMT
0 Similarity exists
with its versions
only
EFCST
& CMT
0
Steps Complexity
Preprocessing θ(n1) + θ(n2)
CST / EFCST θ(n1) + θ(n2)
Difference matrix θ (L1 x L2)
Similarity computation O(L1 x L2)
International Journal of Computer Applications Technology and Research
Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656
www.ijcat.com 736
Table 14a. Performance analysis table (without
considering CMT)Without CMT
Data
structure
and
similarity
measure
used
Data set1 Data set 2 Data set 3
CST &
s=n/d
0.14658 0.34 0.292727
CST &
s=n/|r1-r2|
0.0375 0.0866 0.092727
EFCST &
s=n/|r1-r2|
0.0375 0.0373 0.049
Fig 4: Error graph for proposed approaches without
considering CMT
Table 14b. Performance analysis table (considering
CMT)
With CMT
Data structure and
similarity measure used
Data
set1
Data set
2
Data set
3
CST & s=n/d 0 0.0133 0.0436
CST & s=n/|r1-r2| 0 0.00933 0.02
EFCST & s=n/|r1-r2| 0 0 0
Fig 5: Error graph for proposed approaches without
considering CMT
4. CONCLUSION AND FUTURE WORK
We have proposed two approaches Control Structure Table
(CST) and Execution Flow Control Structure Table (EFCST)
to detect duplicate code detection. We also suggested Control
Metric Table (CMT) before computation of similarity
measure. Performance with the addition of CMT has shown
tremendous improvements.
The time complexity is max (θ(n) and O(L2
)) where 'n' is total
number of source lines and 'L' is total number of control
statements in the program. Time complexity is far less when
compared to methods based on AST and PDG. The method
also identifies all four types of clones.
The proposed algorithms do not take into consideration of
statements inside control structures. The current similarity
measure can be corrected to consider the statements together
with operators and operands. Perhaps errors that are observed
currently may decrease significantly.
5. REFERENCES
[1] Baker S., “A Program for Identifying Duplicated Code,
“Computing Science and Statistics, vol. 24, 1992.
[2] Johnson J H., “Substring matching for clone detection and
change tracking,” in Proceedings of the International
Conference on software Maintenance, 1994.
[3] Ducasse, S., Rieger M., and Demeyer S., “A Language
Independent Approach for Detecting Duplicated Code.” In
Proceedings; IEEE International Conference on Software
Maintenanace, 1999.
[4] Zhang Q., et . al., “Efficient Partial-Duplicate Based on
Sequence Matching,” 2010.
[5] Sadowski C., and Levin G., “SimHash: Hash-Based
Similarity Detection,” 2007.
[6]Jiang L., and Glondu S., “Deckard: Scalable and Accurate
Tree-Based Detection of Code Clones”.
[7] Baxter I D., Yahin I., Moura L., Anna M S., and Bier L.,
“Clone Detection Using Abstract Syntax Trees,” in
proceedings of ICSM. IEEE, 1998.
[8] krinke J., “Identifying Similar Code with Program
Dependency Graphs,” Proc. Eighth Working Conference .,
Reverse Engineering., 2001.
[9] Vidya K and Thirukumar K, “Identifying Functional
Clones between Java Directory using Metric Based Systems”
International journal of Computer Communication and
Information System (IJCCI)-Vol3, ISSN:2277-128x August
2013.
[10] Mayrand J, Leblanc C and Ettore Merlo M. “Experiment
on the Automatic Detection of Function Clones in a Software
System Using Metrics”, proc of ICSM conference 1996.
[11]Sudhamani M and Rangarajan L, Structural Similarity
Detection using Structure of Control Statements, proc. of
International Conference on Information and Communication
Technology, vol 46 (2015) 892-899.
[12] kodhai E, Perumal A, and Kanmani S, "Clone Detection
using Textual and metric Analysis to figure out all Types of
Clones" International journal of Computer Communication
and Information System (IJCCI)- Vol2. No1.ISSN:0976-1349
July-Dec 2010.
[13] www.research.cs.queensu.ca.
[14] Bellon S., Koschke R., Antoniol G., Krinke J., and Merlo
E., “Comparison and evaluation of clone detection tools,”
IEEE Transactions on.Software Engineering, September 2007.
[15]. Roy C K and Cordy J R. "A survey on software clone
detection research". Tech. rep., 2007. TR 2007-541 School of
Computing Queen’s University at Kingston Ontario, Canada.
[16] Roy C K, Cordy J R and Koschke R, “Comparison and
Evaluation of Code Clone Detection Techniques and Tools: A
Qualitative Approach”, Science of Computr Programming,
74(2009) 470-495, 2009.
[17] Kostas Kontogiannis. "Evaluation Experiments on the
Detection of Programming Patterns using Software Metrics".
In Proceedings of the 3rd Working Conference on Reverse
Engineering (WCRE'97), pp. 44-54, Amsterdam, the
Netherlands, October 1997.

More Related Content

PDF
Fy secondsemester2016
PDF
Fy secondsemester2016
PPT
Lecture 5
DOC
Course Breakup Plan- C
PPT
Lecture 21 22
PPTX
BASIC CONCEPTS OF C++ CLASS 12
PDF
Notes how to work with variables, constants and do calculations
PDF
Handout#06
Fy secondsemester2016
Fy secondsemester2016
Lecture 5
Course Breakup Plan- C
Lecture 21 22
BASIC CONCEPTS OF C++ CLASS 12
Notes how to work with variables, constants and do calculations
Handout#06

What's hot (16)

PDF
A WHITE BOX TESTING TECHNIQUE IN SOFTWARE TESTING : BASIS PATH TESTING
PDF
ListMyPolygons 0.6
PPTX
What is algorithm
PDF
Burr Type III Software Reliability Growth Model
PDF
The Improved Hybrid Algorithm for the Atheer and Berry-ravindran Algorithms
PDF
Intake 38_1
PDF
A hybrid model to detect malicious executables
PDF
Intake 38 4
PDF
Simple Obfuscation Tool for Software Protection
PPTX
JavaScript functions
PPTX
C basics
PDF
C programming | Class 8 | III Term
PPT
Lecture 2
PDF
Computer science_xii_2016
PDF
Intake 38 2
PDF
Assignment5
A WHITE BOX TESTING TECHNIQUE IN SOFTWARE TESTING : BASIS PATH TESTING
ListMyPolygons 0.6
What is algorithm
Burr Type III Software Reliability Growth Model
The Improved Hybrid Algorithm for the Atheer and Berry-ravindran Algorithms
Intake 38_1
A hybrid model to detect malicious executables
Intake 38 4
Simple Obfuscation Tool for Software Protection
JavaScript functions
C basics
C programming | Class 8 | III Term
Lecture 2
Computer science_xii_2016
Intake 38 2
Assignment5
Ad

Viewers also liked (19)

PDF
Who we are - Pet products supplier
PPT
Change, Development and Performance In Asia
PPTX
Futbol la historias y reglas
PDF
multi-field-inflation
PPTX
PDF
PDF
Explanatory-Report-Geo-info-for-Italian-LLD
PPTX
Strory board For Grammar
PPTX
PDF
Being an entrepreneur
PPT
Salto alto
PDF
Hybrid Technique for Copy-Move Forgery Detection Using L*A*B* Color Space
PPT
Diphthongs
PPTX
Make a difference within & beyond the hospital chose a career in psychiatry.
PPT
Phonetic and phonology pp2
PDF
Ventajas y desventajas de las ciencias en la sociedad
Who we are - Pet products supplier
Change, Development and Performance In Asia
Futbol la historias y reglas
multi-field-inflation
Explanatory-Report-Geo-info-for-Italian-LLD
Strory board For Grammar
Being an entrepreneur
Salto alto
Hybrid Technique for Copy-Move Forgery Detection Using L*A*B* Color Space
Diphthongs
Make a difference within & beyond the hospital chose a career in psychiatry.
Phonetic and phonology pp2
Ventajas y desventajas de las ciencias en la sociedad
Ad

Similar to Duplicate Code Detection using Control Statements (20)

PDF
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
PDF
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
PDF
IRJET- Code Cloning using Abstract Syntax Tree
PPTX
Plagiarism introduction
PDF
Analyzing Program Similarities and Differences Across Multiple Languages
PDF
A Novel Approach for Code Clone Detection Using Hybrid Technique
PDF
Ctcompare: Comparing Multiple Code Trees for Similarity
PDF
Detecting the High Level Similarities in Software Implementation Process Usin...
PDF
A Source Code Similarity System For Plagiarism Detection
PDF
robert-kovacsics-part-ii-dissertation
PDF
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
PDF
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
PPTX
What do practitioners ask about code clone? A preliminary investigation of St...
DOC
Masters_Thesis_FINAL_COPY
PDF
Behavioral Analysis for Detecting Code Clones
PDF
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
PDF
A Literature Review on Plagiarism Detection in Computer Programming Assignments
PDF
Clone detection in Python
PDF
18BIT41C-SE Metrics Will Need to Learn U4.pdf
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
IRJET- Code Cloning using Abstract Syntax Tree
Plagiarism introduction
Analyzing Program Similarities and Differences Across Multiple Languages
A Novel Approach for Code Clone Detection Using Hybrid Technique
Ctcompare: Comparing Multiple Code Trees for Similarity
Detecting the High Level Similarities in Software Implementation Process Usin...
A Source Code Similarity System For Plagiarism Detection
robert-kovacsics-part-ii-dissertation
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
What do practitioners ask about code clone? A preliminary investigation of St...
Masters_Thesis_FINAL_COPY
Behavioral Analysis for Detecting Code Clones
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
A Literature Review on Plagiarism Detection in Computer Programming Assignments
Clone detection in Python
18BIT41C-SE Metrics Will Need to Learn U4.pdf

More from Editor IJCATR (20)

PDF
Advancements in Structural Integrity: Enhancing Frame Strength and Compressio...
PDF
Maritime Cybersecurity: Protecting Critical Infrastructure in The Digital Age
PDF
Leveraging Machine Learning for Proactive Threat Analysis in Cybersecurity
PDF
Leveraging Topological Data Analysis and AI for Advanced Manufacturing: Integ...
PDF
Leveraging AI and Principal Component Analysis (PCA) For In-Depth Analysis in...
PDF
The Intersection of Artificial Intelligence and Cybersecurity: Safeguarding D...
PDF
Leveraging AI and Deep Learning in Predictive Genomics for MPOX Virus Researc...
PDF
Text Mining in Digital Libraries using OKAPI BM25 Model
PDF
Green Computing, eco trends, climate change, e-waste and eco-friendly
PDF
Policies for Green Computing and E-Waste in Nigeria
PDF
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
PDF
Optimum Location of DG Units Considering Operation Conditions
PDF
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
PDF
Web Scraping for Estimating new Record from Source Site
PDF
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
PDF
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
PDF
A Strategy for Improving the Performance of Small Files in Openstack Swift
PDF
Integrated System for Vehicle Clearance and Registration
PDF
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
PDF
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Advancements in Structural Integrity: Enhancing Frame Strength and Compressio...
Maritime Cybersecurity: Protecting Critical Infrastructure in The Digital Age
Leveraging Machine Learning for Proactive Threat Analysis in Cybersecurity
Leveraging Topological Data Analysis and AI for Advanced Manufacturing: Integ...
Leveraging AI and Principal Component Analysis (PCA) For In-Depth Analysis in...
The Intersection of Artificial Intelligence and Cybersecurity: Safeguarding D...
Leveraging AI and Deep Learning in Predictive Genomics for MPOX Virus Researc...
Text Mining in Digital Libraries using OKAPI BM25 Model
Green Computing, eco trends, climate change, e-waste and eco-friendly
Policies for Green Computing and E-Waste in Nigeria
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Optimum Location of DG Units Considering Operation Conditions
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Web Scraping for Estimating new Record from Source Site
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
A Strategy for Improving the Performance of Small Files in Openstack Swift
Integrated System for Vehicle Clearance and Registration
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25 Week I
Advanced Soft Computing BINUS July 2025.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025

Duplicate Code Detection using Control Statements

  • 1. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 728 Duplicate Code Detection using Control Statements Sudhamani M Department of Computer Science University of Mysore Mysore 570006, India Lalitha Rangarajan Department of Computer Science University of Mysore Mysore 570006, India Abstract: Code clone detection is an important area of research as reusability is a key factor in software evolution. Duplicate code degrades the design and structure of software and software qualities like readability, changeability, maintainability. Code clone increases the maintenance cost as incorrect changes in copied code may lead to more errors. In this paper we address structural code similarity detection and propose new methods to detect structural clones using structure of control statements. By structure we mean order of control statements used in the source code. We have considered two orders of control structures: (i) Sequence of control statements as it appears (ii) Execution flow of control statements. Keywords: Control statements; Control structure; Execution flow; Similarity value; Structural similarity. 1. INTRODUCTION Duplicate codes are identical or similar code fragments present in software program. Two code fragments are similar if these code segments are similar in their structure of control statements and similar control flow between control lines [1, 15]. Different types of code clones are [15] Type 1: Exact similar code fragments except white space and comments as shown in below example. Ex 1: Segment 1: if(n>0) { n=n*1; //multiply by plus 1 } else n=n*-1; // multiply by minus 1 Segment 2: if ( n > 0 ) { n = n * 1; //multiply by +1 } else n = n * -1; // multiply by -1 Type 2: Syntactic similar code fragments except change in variable, literal and function names. Ex 2: Segment 1: if (n>0) { n=n*1; //multiply by plus 1 } else n=n*-1; // multiply by minus 1 Segment 2: if ( m > 0 ) { m = m * 1; //multiply by +1 } else m = m * -1; // multiply by -1 Type 3: Similar code fragments with slight modifications like reordering/addition/deletion of some statements from already existing or copied code fragments. Segment 1: if (n > 0) { n=n*1; //multiply by plus 1 } else n=n*-1; // multiply by minus 1 Segment 2: if (n > 0) { n=n*1; //multiply by plus 1 } else n=n*-1; // multiply by minus 1 x=5; //newly added statement In the above example a new statement x=5 is added. Type 4: Functionally similar code fragments. Below example explains recursive and non recursive way of finding factorial of n. (same program implemented in two ways). Ex: Segment 1: int i, j=1, n; for (i=1; i<=n; i++) j=j*i; segment 2: int fact(int n) { if (n == 0) return 1 ; else return n * fact(n-1) ; } Output of program depends on the execution flow of effective source lines. Execution flow of source lines depends on the control lines used in the program. Control lines considered here are iterative statements (for, while and do-while), conditional statements (if, if-else and switch-case), and
  • 2. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 729 Fig 1: Different versions of bubble sort program function call. Here we propose two approaches to find structural similarity. Approach 1 considers order of control statements present in the code segments and approach 2 depends on the execution flow of control lines in the program. Figure 1 shows three different ways of writing bubble sort program. To find similarity of these programs we compute control structure metrics. Rest of the paper is organized as follows. Section 2 covers key literature, section 3 describes proposed methods and results; section 4 concludes the work with suggestions on possible future work. 2. RELATED WORK Duplicate code detection mainly consists of two phases where first phase is transformation and second phase is comparison. In transformation phase, source code is transformed in to an Internal Code Format (ICF). Depending on the ICF comparison, match detection techniques are classified as follows [15]. i. String Based: In these techniques source code is considered as an arrangement of characters/strings/lines and uses string matching techniques to detect duplicate code [2]. Dup tool compares lexemes on behalf of string match and finds partial match [2, 3, 4]. Ducass et al [5] proposed dynamic matching technique to detect code clones. String based techniques are simple, language independent and detect type I clones [13, 14, 15, 16]. ii. Token Based: In token based approach source code is transformed into sequence of tokens using lexer/parser. Then these sequences of tokens are compared to find duplicate code. This technique detects both type I and II clones. Kamiya et al’s [5] CC Finder regenerate source file into a set of tokens and device single token from these set of tokens and uses suffix tree substring matching algorithm to detect code clones. CP Miner uses frequent substring matching algorithm to replicate tokenized statement. SIM correlate the chain of tokens using dynamic programming string alignment technique. Winnowing and JPlag are token based plagiarism detection tools [13, 14, 15, 16]. iii. Tree Based: Source text is parsed to obtain Abstract Syntax Tree (AST) or parse tree with appropriate parser. Then tree matching techniques are used to find similar sub trees. This approach efficiently detects type I, type II and type III clones [5, 6]. As AST does not address data flow between controls, it fails to detect type IV clones. Baxter et al’s CloneDR find resemblance between programs by matching sub trees of corresponding source program [15]. iv. Graph Based: Source program is converted into Program Dependency Graph (PDG) where PDG contains the data flow and control flow information of the program [6]. Then isomorphic sub graph detection algorithms are used to find duplicate code. This technique efficiently identifies all types of clones. However generating PDG and finding isomorphic sub graphs is NP hard [8]. Komondoor and Horowitz PDG- DUP uses program slicing to find isomorphic sub graphs, Krinke uses iterative approach to detect highest comparable sub graphs. GPLAG is graph based plagiarism disclosure tool [11, 16]. v. Metric Based: In this technique different metrics are computed for code fragments and these metric values are compared to find duplicate code [9, 10, 11, 12]. AST/PDG representation can be used to calculate metrics like number of nodes, number of control edges present in the graph etc. Other common metrics are number of source lines, number of function calls, number of local and global variables and McCabe's cyclomatic complexity etc. eMetric, Covert and
  • 3. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 730 Moss are metric based tools [15, 16]. Kontogiannis et al. [16] build an abstract pattern matching tool to identify probable matches using Markov models to measures similarity between two programs. 3. PROPOSED METHOD Here we propose two approaches to find duplicate code. The different stages in the proposed method are preprocessing, metric computation, difference matrix computation and similarity value calculation. Architecture of proposed method is shown in figure 2 and each stage is explained subsequently. Preprocessing and template conversion In preprocessing stage extra space and comments are removed and input source program is transformed into its standard intermediate template form. Figure 3 shows the template form of versions of sort program in figure 1. This template is used to compute control structure metrics. Fig 2: Architecture of proposed method Fig 3: Templates of sort programs in figure 1 Note that the order / structure of control statements are different across versions. Some versions have function calls and some don’t. Yet proposed approaches can detect duplicate to high accuracy. 3.1 Approach 1 – Computation of similarity using Control Structure Tables (CSTs) Control Structure Table (CST): Control Structure Table contains the information about order of ingrained control lines used in the program [11]. CST of sort program 1 and sort program 2 in figure 1 are shown in table 1 and 2. Table 1. Control structure table for sort program 1 Sl.No Type of control statement Loop Condition 1 Loop 0 0 2 Loop 0 0 3 Loop 1 1 4 Loop 0 1 5 Condition 0 0 6 Loop 0 0 Table 2. Control structure table for sort program 2 Sl.No Type of control statement Loop Conditio n 1 Loop 1 1 2 Loop 0 1 3 Condition 0 0 4 Loop 0 0 5 Loop 0 0 Difference Matrix (D) computation: Difference matrix is calculated using two CSTs. Difference matrix calculated from table 1 and 2 are shown in table 3. Difference matrix shows different between all pairs of control statement. Difference matrix (D) is computed from the respective control structure tables. A row of program 1 (corresponding to a control statement) is compared with every row of program 2. Row I and j of the programs are compared using city block distance formula |Ri1- Rj1|+|Ri2-Rj2|. For example first row of table 1 is compared with second row of table 2 by computing |0-0| + |0-1| =1 is entered in (1, 2) of distance matrix (table 3). From this table we can find similar control lines present in two programs. Presence of zero in a position corresponding to similar control statement indicates structural similarity of the control statements in the two programs. For example zero at (3, 1) in table 3 imply that the iterative statements 3 of program 1 and 1 of program 2 are probably similar. Whereas zero at (5, 3) is not comparable because the control statements of the programs are different (fifth control statement of program 1 is conditional and third control statement of program 2 is iterative). The zeros that contribute to similarity are highlighted.
  • 4. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 731 Table 3. Distance matrix computed from table 1 and 2 Control lines Loop (L) Loop (L) Loop (L) Cond (C) Loop (L) Loop(L) 2 1 0 0 0 Loop (L) 2 1 0 0 0 Loop(L) 0 1 2 2 2 Loop(L) 1 0 1 1 1 Cond(C) 2 1 0 0 0 Loop(L) 2 1 0 0 0 Similarity between codes is found, using the formula s n if r1=r2 1 2 n s r r   otherwise ……. (1) where r1 and r2 are the number of control lines in two programs. From table 3 s = 9/1. We conducted experiments using data set 1 of 5 distinct programs and 15 variants and similarity values are shown in table 4. Table 4. Similarity table for data set 1 (s=n/|r1-r2|) We may observe that in table 4 all programs show highest similarity only with its variants. 3.2 Approach 2: Computation of similarity using execution flow of control statements In pre processing stage all functions are placed above the main function. Function Information Table (FIT) and CST are generated in a single scan of the program. Function Information Table (FIT): FIT gives starting and ending positions where a particular function begins and ends in CST. Here function calls are considered as a control lines. FIT of sort program 2 and 3 are shown in table 5a and 5b. CSTs of these programs are shown in table 6a and 6b. Table 5a. Function Information Table (FIT) for sort program 2 Sl. No Function name Start position End position 1 Sort 1 3 2 Print 4 4 3 main 5 8 Table 5b. Function Information Table (FIT) for sort program 3 Sl. No Function name Start position End position 1 Sort 1 2 2 Print 3 3 3 Main 4 8 The line 1 (first control statement) of program 2 is function name ‘sort’ (beginning of function) is entered in FIT of table 5a (refer function name and start position). The control statements scanned from line 1 onwards are recorded sequentially in CST (table 6a) until end of the function. The end of the function namely line 3 is recorded in FIT. Thus in one scan FIT and CST are generated. Execution Flow Control Structure Table (EFCST) is computed using CST and FIT by replacing the function calls by control lines of that particular function. Programs P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v 3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2 P1v 1 0.00 37.00 37.00 37.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20 P1v2 37.00 0.00 37.00 37.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20 P1v3 37.00 37.00 0.00 37.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20 P1v4 37.00 37.00 37.00 0.00 2.29 2.29 2.29 2.23 2.23 2.23 1.11 1.11 1.11 4.22 4.20 P2v1 2.29 2.29 2.29 2.29 0.00 8.00 8.00 0.83 0.83 0.83 2.50 2.50 2.50 1.13 1.18 P2 2 2.29 2.29 2.29 2.29 8.00 0.00 8.00 0.83 0.83 0.83 2.50 2.50 2.50 1.13 1.18 P2v3 2.29 2.29 2.29 2.29 8.00 8.00 0.00 0.83 0.83 0.83 2.50 2.50 2.50 1.13 1.18 P3v1 2.23 2.23 2.23 2.23 0.83 0.83 0.83 0.00 199.00 199.00 0.61 0.61 0.61 10.92 13.83 P3v2 2.23 2.23 2.23 2.23 0.83 0.83 0.83 199.00 0.00 199.00 0.61 0.61 0.61 10.92 13.83 P3v3 2.23 2.23 2.23 2.23 0.83 0.83 0.83 199.00 199.00 0.00 0.61 0.61 0.61 10.92 13.83 P4v1 1.11 1.11 1.11 1.11 2.50 2.50 2.50 0.61 0.61 0.61 0.00 4.00 4.00 0.83 0.89 P4v2 1.11 1.11 1.11 1.11 2.50 2.50 2.50 0.61 0.61 0.61 4.00 0.00 4.00 0.83 0.89 P4v3 1.11 1.11 1.11 1.11 2.50 2.50 2.50 0.61 0.61 0.61 4.00 4.00 0.00 0.83 0.89 P5v1 4.22 4.22 4.22 4.22 1.13 1.13 1.13 10.92 10.92 10.92 0.83 0.83 0.83 0.00 161.00 P5v2 4.20 4.20 4.20 4.20 1.18 1.18 1.18 13.83 13.83 13.83 0.89 0.89 0.89 161.00 0.00
  • 5. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 732 Table 6a. Control Structure Table (Order) for program 2 in figure 1 Sl. no Control statement Loop Condition 1 Loop 1 1 2 Loop 0 1 3 Condition 0 0 4 Loop 0 0 5 Loop 0 0 6 Print 0 0 7 Sort 0 0 8 Print 0 0 Table 6b. Control Structure Table (Order) for program 3 in figure 1 Sl. no Control statement Loop Condition 1 Loop 0 1 2 Condition 0 0 3 Loop 0 0 4 Loop 0 0 5 Print 0 0 6 Loop 0 0 7 Sort 0 0 8 Print 0 0 Execution Flow Control Structure Table (EFCST) of program 2 is given in table 7. Execution flow starts in ‘main’. From FIT we see that flow starts at line 5 and ends at line 8. The entries in these lines are copied in EFCST. However if function call is present, FIT is referred as corresponding control lines of the function from the respective beginning and ending lines are copied to EFCST. The EFCST of programs 1, 2 and 3 in figure 1 are shown in table 7. Table 7. EFCST of program 1, 2 and 3 Sl. no Control statement Loop Condition 1 Loop 0 0 2 Loop 0 0 3 Loop 1 1 4 Loop 0 1 5 Condition 0 0 6 Loop 0 0 Difference matrix is computed using two EFCSTs as in section 3.1 and similarity value is computed using formula 1. We conducted experiments on data set 1 and results are shown in below table. We conducted experiments on data set 1 and results are shown in table 8. Table 8. EFCST and s=n/|r1-r2| P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2 P1v1 0.00 36.00 36.00 36.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30 P1v2 36.00 0.00 36.00 36.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30 P1v3 36.00 36.00 0.00 36.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30 P1v4 36.00 36.00 36.00 0.00 2.14 2.14 2.14 2.14 2.14 2.14 1.00 1.00 1.00 3.55 3.30 P2v1 2.14 2.14 2.14 2.14 0.00 7.00 7.00 0.76 0.76 0.76 2.00 2.00 2.00 1.00 0.88 P2v2 2.14 2.14 2.14 2.14 7.00 0.00 7.00 0.76 0.76 0.76 2.00 2.00 2.00 1.00 0.88 P2v3 2.14 2.14 2.14 2.14 7.00 7.00 0.00 0.76 0.76 0.76 2.00 2.00 2.00 1.00 0.88 P3v1 2.14 2.14 2.14 2.14 0.76 0.76 0.76 0.00 196.00 196.00 0.58 0.58 0.58 13.91 9.83 P3v2 2.14 2.14 2.14 2.14 0.76 0.76 0.76 196.00 0.00 196.00 0.58 0.58 0.58 13.91 9.83 P3v3 2.14 2.14 2.14 2.14 0.76 0.76 0.76 196.00 196.00 0.00 0.58 0.58 0.58 13.91 9.83 P4v1 1.00 1.00 1.00 1.00 2.00 2.00 2.00 0.58 0.58 0.58 0.00 3.00 3.00 0.75 0.63 P4v2 1.00 1.00 1.00 1.00 2.00 2.00 2.00 0.58 0.58 0.58 3.00 0.00 3.00 0.75 0.63 P4v3 1.00 1.00 1.00 1.00 2.00 2.00 2.00 0.58 0.58 0.58 3.00 3.00 0.00 0.75 0.63
  • 6. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 733 Here also all programs show high similarity only with versions of the same program. 3.3 Similarity computation using CSTs, EFCSTs and Control Metric Table (CMT) Control Metric Table (CMT): We compute control metric table which contains information about total number of iterative and conditional statements present in the program [11]. Table 9 shows CMT of data set 1 used for our experiment. Table 9. Control Metric Table for data set 1 (CMT) Sl. No Programs 1 2 3 4 L C L C L C L C 1 Beam search 10 2 10 2 10 2 10 2 2 Bubble sort 4 1 4 1 4 1 - - 3 Min Max 15 19 15 19 15 19 - - 4 Linear search 2 1 2 1 - - - - 5 Queue 3 18 3 18 3 18 - - Computation of similarity value (s): Here similarity computation is based on CMT as well as CST/EFCST. First we generate CMT and CST for each program. Difference matrix (D) is computed from the respective CSTs as explained in earlier sub sections 3.1 and 3.2. We compute similarity between programs only if programs are comparable in terms of number of loops and conditional statements. While duplicates are created it is unlikely to expect more than 20 % variation in number of control statements. Hence a threshold of 20 % variations in these numbers is fixed for computation of similarity. Suppose program 1 has x loops and y conditional statements. Program 2 is comparable with program 1 if the number loops and conditional statements are in the range [x – 20 % (x), x + 20 % (x)] and [y – 20 % (y), y + 20 % (y)]. Table 10 show computed similarity values with this additional consideration of CMT. Table 10a. CST, CMT and s=n/|r1-r2| P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2 P1v1 0 37 37 37 0 0 0 0 0 0 0 0 0 0 0 P1v2 37 0 37 37 0 0 0 0 0 0 0 0 0 0 0 P1v3 37 37 0 37 0 0 0 0 0 0 0 0 0 0 0 P1v4 37 37 37 0 0 0 0 0 0 0 0 0 0 0 0 P2v1 0 0 0 0 0 8 8 0 0 0 0 0 0 0 0 P2v2 0 0 0 0 8 0 8 0 0 0 0 0 0 0 0 P2v3 0 0 0 0 8 8 0 0 0 0 0 0 0 0 0 P3v1 0 0 0 0 0 0 0 0 199 199 0 0 0 0 0 P3v2 0 0 0 0 0 0 0 199 0 199 0 0 0 0 0 P3v3 0 0 0 0 0 0 0 199 199 0 0 0 0 0 0 P4v1 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0 P4v2 0 0 0 0 0 0 0 0 0 0 4 0 4 0 0 P4v3 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0 P5v1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 161 P5v2 0 0 0 0 0 0 0 0 0 0 0 0 0 161 0 Table 10b. EFCST, CMT and s= n/|r1-r2| P1v1 P1v2 P1v3 P1v4 P2v1 P2v2 P2v3 P3v1 P3v2 P3v3 P4v1 P4v2 P4v3 P5v1 P5v2 P1v1 0 36 36 36 0 0 0 0 0 0 0 0 0 0 0 P1v2 36 0 36 36 0 0 0 0 0 0 0 0 0 0 0 P1v3 36 36 0 36 0 0 0 0 0 0 0 0 0 0 0 P1v4 36 36 36 0 0 0 0 0 0 0 0 0 0 0 0 P5v1 3.55 3.55 3.55 3.55 1.00 1.00 1.00 13.91 13.91 13.91 0.75 0.75 0.75 0.00 125.00 P5v2 3.30 3.30 3.30 3.30 0.88 0.88 0.88 9.83 9.83 9.83 0.63 0.63 0.63 125.00 0.00
  • 7. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 734 P2v1 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0 P2v2 0 0 0 0 7 0 7 0 0 0 0 0 0 0 0 P2v3 0 0 0 0 7 7 0 0 0 0 0 0 0 0 0 P3v1 0 0 0 0 0 0 0 0 196 196 0 0 0 0 0 P3v2 0 0 0 0 0 0 0 196 0 196 0 0 0 0 0 P3v3 0 0 0 0 0 0 0 196 196 0 0 0 0 0 0 P4v1 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 P4v2 0 0 0 0 0 0 0 0 0 0 3 0 3 0 0 P4v3 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 P5v1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 125 P5v2 0 0 0 0 0 0 0 0 0 0 0 0 0 125 0 In the above tables similarity is seen only with versions of the same program. All others are 0’s. 3.4 Experimental Results Five programs, 15 versions data set described in earlier sections is created in our lab and the experimental results with two approaches have been discussed in detail in sections 3.1 to 3.3. For thorough testing of the proposed approaches we downloaded programs from ‘sourcefoge.net’ (www.sourceforge.net) and ‘f1sourcecode’ (www.f1sourcecode.com) and created many versions by changing loop statements, reordering control lines and also by refactoring. These are added to the sample data set in the earlier sections. Thus we have created 26 distinct programs and 100 versions data set. To find whether only versions of the same programs, show higher similarity when compared to similarities with other programs, we have done clustering of similarity values using k-means clustering algorithm with k=2. The clustering is done on set of similarity value corresponding to one version of a program (available in a column). The error in duplicate detection of a program 'j' is found as ratio of number of misclassification and total number of programs (inclusive of versions). Total misclassification in program 'j' includes number of false positives and true negatives. When a version of program 'j' is clustered with any other program it is true negative, where as when a version of program 'i' is clustered with program 'j' it is false positive. Average error is computed total detection errors in each program by number of distinct programs. Table 11 shows the average error with two approaches with and without CMT for the sample data sets. Also shown in the table the similarity measurements using the formula s=n/D, where n is similar number of control lines and 'D' maximum dissimilarity [11]. Table 11. Error table for sample data sets. Approaches Data structure used Data set1 Data set 2 S= n /D S= n / |r1-r2| S= n / D S=n / |r1-r2| Approach1 Only CST 0.1465 0.0375 0.5794 0.1038 CST and CMT 0 0 0.00923 0.00577 Approach2 Only EFCSTs 0.04 0.0375 0.0866 0.009615 EFCST and CMT 0 0 0.009615 0.00808
  • 8. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 735 3.5 Time Complexity Suppose two programs have n1 and n2 source lines and L1 and L2 control statements. Note that number of control statements in a program will be far less than number of source lines (L<< n). Table 12 shows the detail of major steps in the computation of similarity and the corresponding complexities. Table 12. Time complexity table Hence total time complexity is maximum (θ(n) and O(L2 )) which is a polynomial time complexity. 3.6 Performance Evaluation The experiments are done with three available tools Duplo (uses string matching technique), PMD (uses tokens to compare) and CloneDR (AST based) and the results obtained on data set 1 is shown in table 13. PMD tool shows similarity with user defined function call and inbuilt function. Control lines for and while, from figure 1 are not shown as similar. CloneDR is sometimes sensitive to change in the type of loop statement. We divided data set 2 which is used in section 3.4 into two data sets. First data set has 15 distinct programs and 50 variants. This data set has variation in sequence of control statements (independent control lines only) in versions of the same program. Second data set has 11 distinct programs and 50 variants. In this data set contents of control lines are replaced by function calls (refer fig 1). Experiments are conducted on two data sets using two approaches. Tables 14a and 14b show performance analysis for proposed methods. Table 13. Performance analysis table Sl. no Method Error Remarks 1 Duplo 1.8666 All versions of beam search show some similarity with all versions of minmax and bubble sort programs are not shown as similar programs. 2 PMD 1.6 All versions of beam search show some similarity with all versions of minmax and queue programs are not shown as similar programs. 3 Clone DR 1.8666 All versions of beam search show some similarity with all versions of minmax and queue programs are not shown as similar programs. 4 Proposed Approaches Only CST 0.14658 Linear search and beam search programs show similarity with versions of other programs Only EFCST 0.04 Linear search program shows similarity with bubble sort programs CST & CMT 0 Similarity exists with its versions only EFCST & CMT 0 Steps Complexity Preprocessing θ(n1) + θ(n2) CST / EFCST θ(n1) + θ(n2) Difference matrix θ (L1 x L2) Similarity computation O(L1 x L2)
  • 9. International Journal of Computer Applications Technology and Research Volume 4– Issue 10, 728 - 736, 2015, ISSN: 2319–8656 www.ijcat.com 736 Table 14a. Performance analysis table (without considering CMT)Without CMT Data structure and similarity measure used Data set1 Data set 2 Data set 3 CST & s=n/d 0.14658 0.34 0.292727 CST & s=n/|r1-r2| 0.0375 0.0866 0.092727 EFCST & s=n/|r1-r2| 0.0375 0.0373 0.049 Fig 4: Error graph for proposed approaches without considering CMT Table 14b. Performance analysis table (considering CMT) With CMT Data structure and similarity measure used Data set1 Data set 2 Data set 3 CST & s=n/d 0 0.0133 0.0436 CST & s=n/|r1-r2| 0 0.00933 0.02 EFCST & s=n/|r1-r2| 0 0 0 Fig 5: Error graph for proposed approaches without considering CMT 4. CONCLUSION AND FUTURE WORK We have proposed two approaches Control Structure Table (CST) and Execution Flow Control Structure Table (EFCST) to detect duplicate code detection. We also suggested Control Metric Table (CMT) before computation of similarity measure. Performance with the addition of CMT has shown tremendous improvements. The time complexity is max (θ(n) and O(L2 )) where 'n' is total number of source lines and 'L' is total number of control statements in the program. Time complexity is far less when compared to methods based on AST and PDG. The method also identifies all four types of clones. The proposed algorithms do not take into consideration of statements inside control structures. The current similarity measure can be corrected to consider the statements together with operators and operands. Perhaps errors that are observed currently may decrease significantly. 5. REFERENCES [1] Baker S., “A Program for Identifying Duplicated Code, “Computing Science and Statistics, vol. 24, 1992. [2] Johnson J H., “Substring matching for clone detection and change tracking,” in Proceedings of the International Conference on software Maintenance, 1994. [3] Ducasse, S., Rieger M., and Demeyer S., “A Language Independent Approach for Detecting Duplicated Code.” In Proceedings; IEEE International Conference on Software Maintenanace, 1999. [4] Zhang Q., et . al., “Efficient Partial-Duplicate Based on Sequence Matching,” 2010. [5] Sadowski C., and Levin G., “SimHash: Hash-Based Similarity Detection,” 2007. [6]Jiang L., and Glondu S., “Deckard: Scalable and Accurate Tree-Based Detection of Code Clones”. [7] Baxter I D., Yahin I., Moura L., Anna M S., and Bier L., “Clone Detection Using Abstract Syntax Trees,” in proceedings of ICSM. IEEE, 1998. [8] krinke J., “Identifying Similar Code with Program Dependency Graphs,” Proc. Eighth Working Conference ., Reverse Engineering., 2001. [9] Vidya K and Thirukumar K, “Identifying Functional Clones between Java Directory using Metric Based Systems” International journal of Computer Communication and Information System (IJCCI)-Vol3, ISSN:2277-128x August 2013. [10] Mayrand J, Leblanc C and Ettore Merlo M. “Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics”, proc of ICSM conference 1996. [11]Sudhamani M and Rangarajan L, Structural Similarity Detection using Structure of Control Statements, proc. of International Conference on Information and Communication Technology, vol 46 (2015) 892-899. [12] kodhai E, Perumal A, and Kanmani S, "Clone Detection using Textual and metric Analysis to figure out all Types of Clones" International journal of Computer Communication and Information System (IJCCI)- Vol2. No1.ISSN:0976-1349 July-Dec 2010. [13] www.research.cs.queensu.ca. [14] Bellon S., Koschke R., Antoniol G., Krinke J., and Merlo E., “Comparison and evaluation of clone detection tools,” IEEE Transactions on.Software Engineering, September 2007. [15]. Roy C K and Cordy J R. "A survey on software clone detection research". Tech. rep., 2007. TR 2007-541 School of Computing Queen’s University at Kingston Ontario, Canada. [16] Roy C K, Cordy J R and Koschke R, “Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach”, Science of Computr Programming, 74(2009) 470-495, 2009. [17] Kostas Kontogiannis. "Evaluation Experiments on the Detection of Programming Patterns using Software Metrics". In Proceedings of the 3rd Working Conference on Reverse Engineering (WCRE'97), pp. 44-54, Amsterdam, the Netherlands, October 1997.