Supporting program comprehension with source code summarization

Supporting Program Comprehension
with Source Code Summarization

Sonia Haiduc, Jairo
Aponte, Andrian Marcus
Presented By: Mohammad Masudur Rahman

Contents










2

Why Code Summarization?
Thesis Statement
Research Questions about summary
Research Questions about tool
Automatic Code Summarization
Evaluation
Experiments Conducted
Pyramid Method
Important Findings
My Observation & Future Works

Why Code Summarization?
 Program

comprehension 50% of all
maintenance works
 Two extreme approaches – skim through and
read thoroughly
 Skim through – leads to misunderstanding
 Read thoroughly – time consuming
 An intermediate solution – source code entity
with comprehensive textual description
3

Thesis Statement
 New

idea: code summarization to help in
program comprehension (PC)
 Applying TR methods like Latent Semantic
Indexing in source code summarization.
 Combining structural information with
retrieved code summary to make it effective
for realistic purposes.
4

Research Questions of Code
Summarization
 Summary

should be automatically generated
 Generate summary to different granularity
levels – class, method, packages etc
 Shorter than the source code
 Capture and preserve code semantics and
structure – text as well as structure from the
code
 Consistent structure – important items at first
5

Research Questions of Code
Summarization
 Summary

should reflect the developer’s
understanding about the code
 Tool should allow user to change summary
and will remember user’s choice in future
summary
 Tool should rebuild the summary if the code
changes or developer’s provide feedback
6

Research Questions about
Summarizer Tool









7

Which summarization technique works the best for
source code?
What type of structural info necessary in summary?
Will the summary be different for different type of
maintenance task?
How long it would be?
How much will it resemble to actual summary?
How do developers generate summary?

 Generate

extractive summary – the most
important info extracted from the document

8

 Two

types info extracted – lexical and
structural
 Lexical info – identifiers and comments are
extracted
 Common English and PL keywords are
removed
 Identifiers are split into constituent words and
stemming performed.
9

 Extracted

lexical info forms the text corpus of
code where TR methods (e.g. LSI) used to
get most important n words.
 Once retrieved, n words are combined with
structural info like their class name, method
name, package name, parameter name and
type etc
 How to apply structural info to autogenerated summary is an important part
10

A

method name reflects the description of
what it does.
 If method name ignored by TR, the tool can
introduce it automatically
 Additional info can be added like –user tags

11

Evaluation






12

Two types – intrinsic and extrinsic
Intrinsic – content evaluation, how closely it depicts
the document or how close to manually generated
summary
Metrics- precision, recall, pyramid method
Extrinsic – how much utility and usability it has to
support SE tasks – concept location, impact
analysis, software reuse, traceability links recovery
etc

 Pyramid

method
 ATunes OS project, 12 methods
 6 developers from different demographic
locations, undergraduate students, 3 years
Java programming experiences
 Developers provided with a list of terms, they
need to choose 5 terms for each method that
suits best, 60 minutes total time
13

 Corpus

containing whole code vocabulary
 Each method is a different document
 LSI indexing the corpus against each method
terms
 Cosine measure between corpus and
method and corpus words are ranked
 Top 5 words from corpus are chosen
14

Pyramid method
 Pyramid

score = (Sum of A’s score / Total
score A could make)

15

Important Findings








17

Pyramid score >=.1 and <=.5, marked it encouraging
Words chosen by developers – 98.7% in method
name, 88.9% in class name and 84.6% in parameter
name
Automatic summary terms – 20% in method name,
12.9% in class name and 30.7% in parameter name
Structural info should be considered properly in
automatic summary
Comments text not included in summary

My Observation &Future Works








18

The corpus development technique is not well
specified- no specification about redundancy
protection
LSI focuses on term frequency rather than structural
info which produces bad scores.
During cosine measurement structural info of term in
the method could be considered to get better results
There should have some heuristic measurement for
structural info.

Supporting program comprehension with source code summarization

Recommended

More Related Content

What's hot (18)

Viewers also liked (8)

Similar to Supporting program comprehension with source code summarization (20)

More from Masud Rahman (20)

Recently uploaded (20)

Supporting program comprehension with source code summarization