An Insight into the Unresolved Questions at Stack Overflow

AN INSIGHT INTO THE UNRESOLVED
QUESTIONS AT STACK OVERFLOW
Mohammad Masudur Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan
Presented By: Ripon K. Saha
12th Working Conference on Mining Software
Repositories (MSR 2015) (Challenge Track)
Florence, Italy

RESEARCH PROBLEM: HIGHER RATE OF
UNRESOLVED QUESTIONS
 Unresolved question:
none of the answers was
accepted as a solution.
 Exponential increase over
the last 6 years.
 2.4m (27%) unresolved
out of 8.8m questions at SO
(Feb, 2015)
RQ1: Why do questions at Stack Overflow remain unresolved for
long time?
RQ2: Can we predict the questions for which none of the answers
might be accepted as solutions?
2

ASPECTS OF STUDY
 Comparative analysis (RQ1)
between questions using four
aspects:
 Lexical Analysis
 Code Readability (CR)
 Text Readability (TR)
 Semantic Analysis
 Topic Similarity (TS)
 Topic Entropy (TE)
 User Behaviour Analysis
 Answer Rejection Ratio (ARR)
 Last Access Delay (LAD)
 Popularity Analysis
 Votes for Questions (V)
 Reputation of Question Owners (R)
Dataset Used
 3,956 Unresolved
questions & 4,101
Resolved questions
 Each question has at
least 10 answers.
3

CODE & TEXT READABILITY
 Existing readability tools used– Buse and Weimer (TSE,
2010) and Readability Grade levels (Ponzanelli et al, ICSME,
2014)
 Distribution Fitting Curves of readability
 No significant difference in readability between two
types of questions. 4

TOPIC SIMILARITY & TOPIC ENTROPY
 Mallet (McCallum, 2002) for topic modeling
 Topic Similarity (Fig-a) between questions and
corresponding answers identical for both question types.
 Topic Entropy (i.e., topic uncertainty) (Fig-b) higher for
unresolved questions– unresolved questions are
less specific about topics of requirement.
5

USER BEHAVIOUR ANALYSIS
 Distribution Fitting Curves of rejection ratio.
 Owners of unresolved questions have greater
answer rejection ratio.
 Owners of unresolved questions are less frequent
at Stack Overflow. 6

POPULARITY ANALYSIS
 Used Question Votes and User Reputation
 Unresolved questions are less popular than resolved
questions.
 Owners of unresolved questions are less reputed.
7

PREDICTION MODELS (RQ2)
Algorithm Metrics Overall
Accuracy
Unresolved Questions
Precision Recall
J48
{ TE, ARR, LAD, V, R } 78.11% 78.70% 76.10%
{ARR, LAD, V} 77.90% 79.60% 73.90%
Logistic
Regression
{ TE, ARR, LAD, V, R } 73.58% 72.60% 74.20%
{ARR, LAD, V} 73.28% 71.70% 75.20%
Naïve
Bayes
{ TE, ARR, LAD, V, R } 71.69% 69.50% 75.50%
{ARR, LAD, V} 74.48% 80.00% 64.00%
 Three prediction models used from WEKA with 10-fold
cross-validation.
 78.11% prediction accuracy with 78.70% precision
and 76.10% recall.
 The identified features are satisfactorily predictive.
8

TAKE-HOME MESSAGE
 27% of SO questions are unresolved, and they are
increasing almost exponentially.
 Unresolved questions are ambiguous, less
focused and less popular.
 Owners of unresolved questions are less reputed
and less frequent at SO.
 Identified features can satisfactorily separate
unresolved from resolved questions.
 Findings can assist in question quality
management at SO.
9

An Insight into the Unresolved Questions at Stack Overflow

More Related Content

What's hot (18)

Similar to An Insight into the Unresolved Questions at Stack Overflow (20)

More from Masud Rahman (20)

Recently uploaded (20)

An Insight into the Unresolved Questions at Stack Overflow

Editor's Notes