SlideShare a Scribd company logo
An {Execution-Semantic,
Content-and-Context}-Based
Code-Clone
{Detection,Analysis}
Toshihiro Kamiya
Future University Hakodate
kamiya@fun.ac.jp
Toshihiro Kamiya: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis,
Proceedings of the 9th IEEE International Workshop on Software Clones (IWSC'15), pp. 1-7 (2015).
TOC
● Problem/Motivation
● Outline of proposed method
● Example
● Algorithm of clone detection
● Visualization
● Implementation
● Preliminary experiment
The problems / Motivation
● In functional PLs, developers can define their own control
structure.
– Analyzing only pre-defined control statements is no longer sufficient to
represent code pattern.
– E.g., if (C) A; else B; ⇔ myIf(C, lambdaA, lambdaB);
→ inter-procedural analysis
● Dynamic dispatching makes inter-procedural analysis difficult.
– Esp. in functional + OO + dynamically typed PLs
(no explicit type declaration → hard to analyze dispatches in a static
way)
Idea
Detect clones from an execution trace !
● Dispatches and control structures have been
expanded (resolved).
● Detected clones are inter-procedural, type 3
clones.
Outline of proposed method
● Execution trace
→ Call tree
→ Contents and Context (for each node)
●
main()
os.listdir()
print_extensions
_w_for_stmt()
print_extensions
_w_map_func()
os.path.
splitext() print str.join()get_extensions() print
map()
lambda() at line 8
os.path.
splitext()
contents
context
Clone detection
Clone analysis
Contents
Context
Example code
These two functions are...
A helper function
...a semantic clone.
The same
functionality: finds
extensions of given
files and prints
them out
Shared items
and differences
Distinct loops.
for vs map
All shared items are
contained in a function.
Shared items are
spread into functions.
Detection steps
Input: a call tree (← execution trace ← target
program)
1. Extracts contents and context of each node
2. Identifies sets of contents-sharing nodes
3. Removes redundant nodes (filtering with
contexts)
Input
…
call __main__//<module> runpy//_run_code 69
:
load_const __main__//<module> 0
load_const __main__//<module> 12
load_const __main__//<module> 21
load_const __main__//<module> 30
load_const __main__//<module> 39
call __main__//main __main__//<module> 63
:
call __main__//print_extensions_w_for_stmt __main__//main 24
: <list>
call posixpath//splitext __main__//print_extensions_w_for_stmt 25
: 'about.txt'
call genericpath//_splitext posixpath//splitext 18
: 'about.txt' '/' None '.'
load_const genericpath//_splitext 0
return genericpath//_splitext 139
: * 'about' '.txt'
return posixpath//splitext 21
: * 'about' '.txt'
call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32
: '.txt'
return pygoat.hook/Out/write 15
call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33
: 'n'
return pygoat.hook/Out/write 15
call posixpath//splitext __main__//print_extensions_w_for_stmt 25
: 'pygoat.data'
call genericpath//_splitext posixpath//splitext 18
: 'pygoat.data' '/' None '.'
load_const genericpath//_splitext 0
return genericpath//_splitext 139
: * 'pygoat' '.data'
return posixpath//splitext 21
: * 'pygoat' '.data'
call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32
: '.data'
return pygoat.hook/Out/write 15
call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33
: 'n'
return pygoat.hook/Out/write 15
call posixpath//splitext __main__//print_extensions_w_for_stmt 25
: 'greeting.md'
call genericpath//_splitext posixpath//splitext 18
: 'greeting.md' '/' None '.'
load_const genericpath//_splitext 0
return genericpath//_splitext 139
: * 'greeting' '.md'
return posixpath//splitext 21
: * 'greeting' '.md'
call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32
: '.md'
return pygoat.hook/Out/write 15
call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33
Program
Execution trace
main()
os.listdir()
print_extensions
_w_for_stmt()
print_extensions
_w_map_func()
os.path.
splitext() print str.join()get_extensions() print
map()
lambda() at line 8
os.path.
splitext()
Call tree
Input: a call tree (← execution trace ← target
program)
1. Extracts contents and context of each node
2. Identifies sets of contents-sharing nodes
3. Removes redundant nodes (filtering with
contexts)
Step 1.
1. Extracts contents and context of each node
main()
os.listdir()
print_extensions
_w_for_stmt()
print_extensions
_w_map_func()
os.path.
splitext() print str.join()get_extensions() print
map()
lambda() at line 8
os.path.
splitext()
main()
get_extensions(),
map(),
lambda() at line 8,
os.listdir(),
os.path.split(),
print,
print_extensions_w_for_stmt(),
print_extensions_w_map_func(),
str.join()
print_extensions_w_for_stmt()
main()
os.path.split()
print
print_extensions_w_map_func()
main()
get_extensions(),
map(),
lambda() at line 8,
os.path.split(),
print,
str.join()
Input: a call tree (← execution trace ← target
program)
1. Extracts contents and context of each node
2. Identifies sets of contents-sharing nodes
3. Removes redundant nodes (filtering with
contexts)
Step 2.
2. Identifies sets of contents-sharing nodes
main()
get_extensions(),
map(),
lambda() at line 8,
os.listdir(),
os.path.split(),
print,
print_extensions_w_for_stmt(),
print_extensions_w_map_func(),
str.join()
print_extensions_w_for_stmt()
main()
os.path.split()
print
print_extensions_w_map_func()
main()
get_extensions(),
map(),
lambda() at line 8,
os.path.split(),
print,
str.join()
Input: a call tree (← execution trace ← target
program)
1. Extracts contents and context of each node
2. Identifies sets of contents-sharing nodes
3. Removes redundant nodes (filtering with
contexts)
Step 3.
3. Removes redundant nodes (filtering with
contexts) main()
get_extensions(),
map(),
lambda() at line 8,
os.listdir(),
os.path.split(),
print,
print_extensions_w_for_stmt(),
print_extensions_w_map_func(),
str.join()
print_extensions_w_for_stmt()
main()
os.path.split()
print
print_extensions_w_map_func()
main()
get_extensions(),
map(),
lambda() at line 8,
os.path.split(),
print,
str.join()
Included by all of other
nodes in the set
⇒ redundant
Input: a call tree (← execution trace ← target
program)
1. Extracts contents and context of each node
2. Identifies sets of contents-sharing nodes
3. Removes redundant nodes (filtering with
contexts)
Detection result
A clone class:
{ print_extensions_w_map_func(),
print_extensions_w_for_stmt() }
Shared items:
{ os.path.split(), print }
print_extensions_w_for_stmt()
main()
os.path.split()
print
print_extensions_w_map_func()
main()
get_extensions(),
map(),
lambda() at line 8,
os.path.split(),
print,
str.join()
Detection result
A clone class:
{ print_extensions_w_map_func(),
print_extensions_w_for_stmt() }
Shared items:
{ os.path.split(), print }
dagified (merged) by label
(DAG = directed acyclic graph)
Context
Contents
main()
print_extensions
_w_for_stmt()
print_extensions
_w_map_func()
get_extensions()print
map()
lambda() at line 8
os.path.
splitext()
Content-and-context analysis for triaging
● Clone class (a), shared items (b), distinct contents (or gap) (c)
● The distinct contents (c) shared the same set of
(sub-)contents (d) → (c) is another clone class.
● If (c) is merged before (a), (c) will not be a gap of (a)
anymore.
(a)
(b)
(c)
(d)
Detected from markdown2's
code (described later)
Tool prototype
Target program Inputs / Test
cases
Execution
(Python
interpreter)
Execution trace
Debugging /
profiling APIs
Execution trace
extraction
String balloon
generation
String balloons
Frequent item-set
mining
(Apriori)
Similar sets of
contents
Redundant context
removal
Code clones
Step 1
Step 2
Step 3
Detection
Visualization Metrics calculation
Analysis
● Input: Python source code
● Uses a frequent item-set mining
algorithm / implementation
– Apriori (www.borgelt.net/apriori.html)
● Heuristics / optimizations
– Max. depth of contents from a target node
(default 5)
– Max. number of content items of a
candidate node (default 25)
● Filters out the nodes with large contents, i.e.,
nodes near to the root of call tree
– Removal of basic, primitive functions
– ...
Content-and-context clone on call graph
Preliminary experiment
for each of the parameter(“Max. number of
content items of a candidate node”) values:
10, 15, …, 30.
Target product Collection of exe. seq. # function
calls
# unique
labels
markdown2 Running 144 unit tests 227K 1128
wxPython Invoking a sample
program “pySketch”
483K 1058
Results
Results
Exponential to
number of contents
Too “peaky” for practical use
Summary
● A code-clone detection from a dynamic info, execution trace
– Aiming to apply functional/dynamically typed PLs
● Context-and-content analysis for triage
● Algorithm, implementation, heuristics
● Preliminary experiment
– Targets: markdown2 and wxPython
– Peaky, sensitive to a parameter Max. number of content items of a candidate node →
Needs refinements
Omitted, refer the paper:
● Threats to validity
● Future plan
(a)
(b)
(c)
(d)

More Related Content

What's hot (20)

PDF
Notes part 8
Keroles karam khalil
 
PDF
answer-model-qp-15-pcd13pcd
Syed Mustafa
 
PPTX
C language updated
Arafat Bin Reza
 
PDF
Embedded C - Lecture 2
Mohamed Abdallah
 
PDF
Hands-on Introduction to the C Programming Language
Vincenzo De Florio
 
PDF
C Programming Project
Vijayananda Mohire
 
PPTX
Yacc (yet another compiler compiler)
omercomail
 
PDF
Advanced C Language for Engineering
Vincenzo De Florio
 
ODP
OpenGurukul : Language : C Programming
Open Gurukul
 
PDF
Programming languages
Eelco Visser
 
PDF
C Programming Tutorial - www.infomtec.com
M-TEC Computer Education
 
PDF
C programming day#1
Mohamed Fawzy
 
PPT
C++ Programming Course
Dennis Chang
 
PDF
Function overloading ppt
Prof. Dr. K. Adisesha
 
PPTX
Overview of c language
shalini392
 
PDF
L6
lksoo
 
DOC
'C' language notes (a.p)
Ashishchinu
 
PPT
C language basics
Nikshithas R
 
PDF
Unit iii
SHIKHA GAUTAM
 
PDF
C intro
SHIKHA GAUTAM
 
Notes part 8
Keroles karam khalil
 
answer-model-qp-15-pcd13pcd
Syed Mustafa
 
C language updated
Arafat Bin Reza
 
Embedded C - Lecture 2
Mohamed Abdallah
 
Hands-on Introduction to the C Programming Language
Vincenzo De Florio
 
C Programming Project
Vijayananda Mohire
 
Yacc (yet another compiler compiler)
omercomail
 
Advanced C Language for Engineering
Vincenzo De Florio
 
OpenGurukul : Language : C Programming
Open Gurukul
 
Programming languages
Eelco Visser
 
C Programming Tutorial - www.infomtec.com
M-TEC Computer Education
 
C programming day#1
Mohamed Fawzy
 
C++ Programming Course
Dennis Chang
 
Function overloading ppt
Prof. Dr. K. Adisesha
 
Overview of c language
shalini392
 
L6
lksoo
 
'C' language notes (a.p)
Ashishchinu
 
C language basics
Nikshithas R
 
Unit iii
SHIKHA GAUTAM
 
C intro
SHIKHA GAUTAM
 

Similar to An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis (20)

PDF
Not Your Fathers C - C Application Development In 2016
maiktoepfer
 
ODP
Linux kernel tracing superpowers in the cloud
Andrea Righi
 
PPTX
Andriy Shalaenko - GO security tips
OWASP Kyiv
 
PDF
Semmle Codeql
M. S.
 
PDF
02 c++g3 d (1)
Mohammed Ali
 
PDF
R programming for data science
Sovello Hildebrand
 
PDF
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Linaro
 
PDF
Picking Mushrooms after Cppcheck
Andrey Karpov
 
PDF
Scala laboratory: Globus. iteration #2
Vasil Remeniuk
 
PDF
C notes.pdf
Durga Padma
 
PDF
Modeling the Behavior of Threads in the PREEMPT_RT Linux Kernel Using Automata
Daniel Bristot de Oliveira
 
PDF
breaking_dependencies_the_solid_principles__klaus_iglberger__cppcon_2020.pdf
VishalKumarJha10
 
PDF
Go 1.10 Release Party - PDX Go
Rodolfo Carvalho
 
PDF
Clang: More than just a C/C++ Compiler
Samsung Open Source Group
 
PDF
Internship - Final Presentation (26-08-2015)
Sean Krail
 
PDF
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ScyllaDB
 
PPTX
Generate typings from JavaScript with TypeScript 3.7
Benny Neugebauer
 
PDF
C++ amp on linux
Miller Lee
 
PDF
Checking the Open-Source Multi Theft Auto Game
Andrey Karpov
 
Not Your Fathers C - C Application Development In 2016
maiktoepfer
 
Linux kernel tracing superpowers in the cloud
Andrea Righi
 
Andriy Shalaenko - GO security tips
OWASP Kyiv
 
Semmle Codeql
M. S.
 
02 c++g3 d (1)
Mohammed Ali
 
R programming for data science
Sovello Hildebrand
 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Linaro
 
Picking Mushrooms after Cppcheck
Andrey Karpov
 
Scala laboratory: Globus. iteration #2
Vasil Remeniuk
 
C notes.pdf
Durga Padma
 
Modeling the Behavior of Threads in the PREEMPT_RT Linux Kernel Using Automata
Daniel Bristot de Oliveira
 
breaking_dependencies_the_solid_principles__klaus_iglberger__cppcon_2020.pdf
VishalKumarJha10
 
Go 1.10 Release Party - PDX Go
Rodolfo Carvalho
 
Clang: More than just a C/C++ Compiler
Samsung Open Source Group
 
Internship - Final Presentation (26-08-2015)
Sean Krail
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ScyllaDB
 
Generate typings from JavaScript with TypeScript 3.7
Benny Neugebauer
 
C++ amp on linux
Miller Lee
 
Checking the Open-Source Multi Theft Auto Game
Andrey Karpov
 
Ad

More from Kamiya Toshihiro (14)

PDF
ソースコード推薦あるいは修正の情報源としての質問掲示板とソースコードレポジトリの比較
Kamiya Toshihiro
 
PDF
Code Difference Visualization by a Call Tree
Kamiya Toshihiro
 
PDF
実行トレース間のデータの差異に基づくデータフロー解析手法の提案
Kamiya Toshihiro
 
PDF
コードクローン研究 ふりかえり ~ストロング・スタイルで行こう~
Kamiya Toshihiro
 
PDF
逆戻りデバッグ補助のための嵌入的スパイの試作
Kamiya Toshihiro
 
PDF
任意粒度機能モデルコードクローン検出手法のリファクタリング理解への適用の試み
Kamiya Toshihiro
 
PDF
任意粒度機能モデルに基づく動的型付けプログラミング言語向けソースコード検索手法の提案
Kamiya Toshihiro
 
PDF
Web アプリケーションの UI 機能テストの ための HTML 構造パターンの抽出手法
Kamiya Toshihiro
 
PDF
WebアプリケーションのUI機能テストのためのHTML構造パターンの提案
Kamiya Toshihiro
 
PDF
An Algorithm for Keyword Search on an Execution Path
Kamiya Toshihiro
 
PDF
And/Or/Callグラフの提案とソースコード検索への応用
Kamiya Toshihiro
 
PDF
PBLへのアジャイル開発手法導入の試み
Kamiya Toshihiro
 
PDF
任意粒度機能モデルに基づくコードクローン検出手法の大規模プログラムの適用に向けた改善
Kamiya Toshihiro
 
PDF
任意粒度機能モデルに基づくバイトコードからのコードクローン検出手法
Kamiya Toshihiro
 
ソースコード推薦あるいは修正の情報源としての質問掲示板とソースコードレポジトリの比較
Kamiya Toshihiro
 
Code Difference Visualization by a Call Tree
Kamiya Toshihiro
 
実行トレース間のデータの差異に基づくデータフロー解析手法の提案
Kamiya Toshihiro
 
コードクローン研究 ふりかえり ~ストロング・スタイルで行こう~
Kamiya Toshihiro
 
逆戻りデバッグ補助のための嵌入的スパイの試作
Kamiya Toshihiro
 
任意粒度機能モデルコードクローン検出手法のリファクタリング理解への適用の試み
Kamiya Toshihiro
 
任意粒度機能モデルに基づく動的型付けプログラミング言語向けソースコード検索手法の提案
Kamiya Toshihiro
 
Web アプリケーションの UI 機能テストの ための HTML 構造パターンの抽出手法
Kamiya Toshihiro
 
WebアプリケーションのUI機能テストのためのHTML構造パターンの提案
Kamiya Toshihiro
 
An Algorithm for Keyword Search on an Execution Path
Kamiya Toshihiro
 
And/Or/Callグラフの提案とソースコード検索への応用
Kamiya Toshihiro
 
PBLへのアジャイル開発手法導入の試み
Kamiya Toshihiro
 
任意粒度機能モデルに基づくコードクローン検出手法の大規模プログラムの適用に向けた改善
Kamiya Toshihiro
 
任意粒度機能モデルに基づくバイトコードからのコードクローン検出手法
Kamiya Toshihiro
 
Ad

Recently uploaded (20)

PDF
CERT Basic Training PTT, Brigadas comunitarias
chavezvaladezjuan
 
PPT
Gene expression and regulation University of Manchester
hanhocpt13
 
PPT
Gene expression and regulation University of Manchester
hanhocpt13
 
PDF
Electromagnetism 3.pdf - AN OVERVIEW ON ELECTROMAGNETISM
kaustavsahoo94
 
PDF
Thermal stratification in lakes-J. Bovas Joel.pdf
J. Bovas Joel BFSc
 
PDF
Global Health Initiatives: Lessons from Successful Programs (www.kiu.ac.ug)
publication11
 
PDF
Disk Evolution Study Through Imaging of Nearby Young Stars (DESTINYS): Eviden...
Sérgio Sacani
 
PPTX
Cyclotron_Presentation_theory, designMSc.pptx
MohamedMaideen12
 
PDF
Evidence for a sub-Jovian planet in the young TWA 7 disk
Sérgio Sacani
 
PDF
HOW TO DEAL WITH THREATS FROM THE FORCES OF NATURE FROM OUTER SPACE.pdf
Faga1939
 
PDF
The First Detection of Molecular Activity in the Largest Known Oort Cloud Com...
Sérgio Sacani
 
PPTX
CLIMATE CHANGE BY SIR AHSAN HISTORY.pptx
GulFeroze
 
PDF
Integrating Conversational Agents and Knowledge Graphs within the Scholarly D...
Angelo Salatino
 
PDF
Cultivation and goods of microorganisms-4.pdf
adimondal300
 
PDF
The MUSEview of the Sculptor galaxy: survey overview and the planetary nebula...
Sérgio Sacani
 
PPTX
PROTECTED CULTIVATION ASSIGNMENT 2..pptx
RbDharani
 
PDF
Voyage to the Cosmos of Consciousness.pdf
Saikat Basu
 
PPTX
atom : it is the building unit of the structure of any matter
abdoy2605
 
PPTX
Microbes Involved In Malaria, Microbiology
UMME54
 
PPTX
General properties of connective tissue.pptx
shrishtiv82
 
CERT Basic Training PTT, Brigadas comunitarias
chavezvaladezjuan
 
Gene expression and regulation University of Manchester
hanhocpt13
 
Gene expression and regulation University of Manchester
hanhocpt13
 
Electromagnetism 3.pdf - AN OVERVIEW ON ELECTROMAGNETISM
kaustavsahoo94
 
Thermal stratification in lakes-J. Bovas Joel.pdf
J. Bovas Joel BFSc
 
Global Health Initiatives: Lessons from Successful Programs (www.kiu.ac.ug)
publication11
 
Disk Evolution Study Through Imaging of Nearby Young Stars (DESTINYS): Eviden...
Sérgio Sacani
 
Cyclotron_Presentation_theory, designMSc.pptx
MohamedMaideen12
 
Evidence for a sub-Jovian planet in the young TWA 7 disk
Sérgio Sacani
 
HOW TO DEAL WITH THREATS FROM THE FORCES OF NATURE FROM OUTER SPACE.pdf
Faga1939
 
The First Detection of Molecular Activity in the Largest Known Oort Cloud Com...
Sérgio Sacani
 
CLIMATE CHANGE BY SIR AHSAN HISTORY.pptx
GulFeroze
 
Integrating Conversational Agents and Knowledge Graphs within the Scholarly D...
Angelo Salatino
 
Cultivation and goods of microorganisms-4.pdf
adimondal300
 
The MUSEview of the Sculptor galaxy: survey overview and the planetary nebula...
Sérgio Sacani
 
PROTECTED CULTIVATION ASSIGNMENT 2..pptx
RbDharani
 
Voyage to the Cosmos of Consciousness.pdf
Saikat Basu
 
atom : it is the building unit of the structure of any matter
abdoy2605
 
Microbes Involved In Malaria, Microbiology
UMME54
 
General properties of connective tissue.pptx
shrishtiv82
 

An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis

  • 1. An {Execution-Semantic, Content-and-Context}-Based Code-Clone {Detection,Analysis} Toshihiro Kamiya Future University Hakodate [email protected] Toshihiro Kamiya: An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and Analysis, Proceedings of the 9th IEEE International Workshop on Software Clones (IWSC'15), pp. 1-7 (2015).
  • 2. TOC ● Problem/Motivation ● Outline of proposed method ● Example ● Algorithm of clone detection ● Visualization ● Implementation ● Preliminary experiment
  • 3. The problems / Motivation ● In functional PLs, developers can define their own control structure. – Analyzing only pre-defined control statements is no longer sufficient to represent code pattern. – E.g., if (C) A; else B; ⇔ myIf(C, lambdaA, lambdaB); → inter-procedural analysis ● Dynamic dispatching makes inter-procedural analysis difficult. – Esp. in functional + OO + dynamically typed PLs (no explicit type declaration → hard to analyze dispatches in a static way)
  • 4. Idea Detect clones from an execution trace ! ● Dispatches and control structures have been expanded (resolved). ● Detected clones are inter-procedural, type 3 clones.
  • 5. Outline of proposed method ● Execution trace → Call tree → Contents and Context (for each node) ● main() os.listdir() print_extensions _w_for_stmt() print_extensions _w_map_func() os.path. splitext() print str.join()get_extensions() print map() lambda() at line 8 os.path. splitext() contents context Clone detection Clone analysis Contents Context
  • 7. These two functions are... A helper function
  • 8. ...a semantic clone. The same functionality: finds extensions of given files and prints them out
  • 10. and differences Distinct loops. for vs map All shared items are contained in a function. Shared items are spread into functions.
  • 11. Detection steps Input: a call tree (← execution trace ← target program) 1. Extracts contents and context of each node 2. Identifies sets of contents-sharing nodes 3. Removes redundant nodes (filtering with contexts)
  • 12. Input … call __main__//<module> runpy//_run_code 69 : load_const __main__//<module> 0 load_const __main__//<module> 12 load_const __main__//<module> 21 load_const __main__//<module> 30 load_const __main__//<module> 39 call __main__//main __main__//<module> 63 : call __main__//print_extensions_w_for_stmt __main__//main 24 : <list> call posixpath//splitext __main__//print_extensions_w_for_stmt 25 : 'about.txt' call genericpath//_splitext posixpath//splitext 18 : 'about.txt' '/' None '.' load_const genericpath//_splitext 0 return genericpath//_splitext 139 : * 'about' '.txt' return posixpath//splitext 21 : * 'about' '.txt' call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32 : '.txt' return pygoat.hook/Out/write 15 call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33 : 'n' return pygoat.hook/Out/write 15 call posixpath//splitext __main__//print_extensions_w_for_stmt 25 : 'pygoat.data' call genericpath//_splitext posixpath//splitext 18 : 'pygoat.data' '/' None '.' load_const genericpath//_splitext 0 return genericpath//_splitext 139 : * 'pygoat' '.data' return posixpath//splitext 21 : * 'pygoat' '.data' call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32 : '.data' return pygoat.hook/Out/write 15 call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33 : 'n' return pygoat.hook/Out/write 15 call posixpath//splitext __main__//print_extensions_w_for_stmt 25 : 'greeting.md' call genericpath//_splitext posixpath//splitext 18 : 'greeting.md' '/' None '.' load_const genericpath//_splitext 0 return genericpath//_splitext 139 : * 'greeting' '.md' return posixpath//splitext 21 : * 'greeting' '.md' call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 32 : '.md' return pygoat.hook/Out/write 15 call pygoat.hook/Out/write __main__//print_extensions_w_for_stmt 33 Program Execution trace main() os.listdir() print_extensions _w_for_stmt() print_extensions _w_map_func() os.path. splitext() print str.join()get_extensions() print map() lambda() at line 8 os.path. splitext() Call tree Input: a call tree (← execution trace ← target program) 1. Extracts contents and context of each node 2. Identifies sets of contents-sharing nodes 3. Removes redundant nodes (filtering with contexts)
  • 13. Step 1. 1. Extracts contents and context of each node main() os.listdir() print_extensions _w_for_stmt() print_extensions _w_map_func() os.path. splitext() print str.join()get_extensions() print map() lambda() at line 8 os.path. splitext() main() get_extensions(), map(), lambda() at line 8, os.listdir(), os.path.split(), print, print_extensions_w_for_stmt(), print_extensions_w_map_func(), str.join() print_extensions_w_for_stmt() main() os.path.split() print print_extensions_w_map_func() main() get_extensions(), map(), lambda() at line 8, os.path.split(), print, str.join() Input: a call tree (← execution trace ← target program) 1. Extracts contents and context of each node 2. Identifies sets of contents-sharing nodes 3. Removes redundant nodes (filtering with contexts)
  • 14. Step 2. 2. Identifies sets of contents-sharing nodes main() get_extensions(), map(), lambda() at line 8, os.listdir(), os.path.split(), print, print_extensions_w_for_stmt(), print_extensions_w_map_func(), str.join() print_extensions_w_for_stmt() main() os.path.split() print print_extensions_w_map_func() main() get_extensions(), map(), lambda() at line 8, os.path.split(), print, str.join() Input: a call tree (← execution trace ← target program) 1. Extracts contents and context of each node 2. Identifies sets of contents-sharing nodes 3. Removes redundant nodes (filtering with contexts)
  • 15. Step 3. 3. Removes redundant nodes (filtering with contexts) main() get_extensions(), map(), lambda() at line 8, os.listdir(), os.path.split(), print, print_extensions_w_for_stmt(), print_extensions_w_map_func(), str.join() print_extensions_w_for_stmt() main() os.path.split() print print_extensions_w_map_func() main() get_extensions(), map(), lambda() at line 8, os.path.split(), print, str.join() Included by all of other nodes in the set ⇒ redundant Input: a call tree (← execution trace ← target program) 1. Extracts contents and context of each node 2. Identifies sets of contents-sharing nodes 3. Removes redundant nodes (filtering with contexts)
  • 16. Detection result A clone class: { print_extensions_w_map_func(), print_extensions_w_for_stmt() } Shared items: { os.path.split(), print } print_extensions_w_for_stmt() main() os.path.split() print print_extensions_w_map_func() main() get_extensions(), map(), lambda() at line 8, os.path.split(), print, str.join()
  • 17. Detection result A clone class: { print_extensions_w_map_func(), print_extensions_w_for_stmt() } Shared items: { os.path.split(), print } dagified (merged) by label (DAG = directed acyclic graph) Context Contents main() print_extensions _w_for_stmt() print_extensions _w_map_func() get_extensions()print map() lambda() at line 8 os.path. splitext()
  • 18. Content-and-context analysis for triaging ● Clone class (a), shared items (b), distinct contents (or gap) (c) ● The distinct contents (c) shared the same set of (sub-)contents (d) → (c) is another clone class. ● If (c) is merged before (a), (c) will not be a gap of (a) anymore. (a) (b) (c) (d) Detected from markdown2's code (described later)
  • 19. Tool prototype Target program Inputs / Test cases Execution (Python interpreter) Execution trace Debugging / profiling APIs Execution trace extraction String balloon generation String balloons Frequent item-set mining (Apriori) Similar sets of contents Redundant context removal Code clones Step 1 Step 2 Step 3 Detection Visualization Metrics calculation Analysis ● Input: Python source code ● Uses a frequent item-set mining algorithm / implementation – Apriori (www.borgelt.net/apriori.html) ● Heuristics / optimizations – Max. depth of contents from a target node (default 5) – Max. number of content items of a candidate node (default 25) ● Filters out the nodes with large contents, i.e., nodes near to the root of call tree – Removal of basic, primitive functions – ... Content-and-context clone on call graph
  • 20. Preliminary experiment for each of the parameter(“Max. number of content items of a candidate node”) values: 10, 15, …, 30. Target product Collection of exe. seq. # function calls # unique labels markdown2 Running 144 unit tests 227K 1128 wxPython Invoking a sample program “pySketch” 483K 1058
  • 22. Results Exponential to number of contents Too “peaky” for practical use
  • 23. Summary ● A code-clone detection from a dynamic info, execution trace – Aiming to apply functional/dynamically typed PLs ● Context-and-content analysis for triage ● Algorithm, implementation, heuristics ● Preliminary experiment – Targets: markdown2 and wxPython – Peaky, sensitive to a parameter Max. number of content items of a candidate node → Needs refinements Omitted, refer the paper: ● Threats to validity ● Future plan (a) (b) (c) (d)