SlideShare a Scribd company logo
AN INSIGHT INTO THE UNRESOLVED
QUESTIONS AT STACK OVERFLOW
Mohammad Masudur Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan
Presented By: Ripon K. Saha
12th Working Conference on Mining Software
Repositories (MSR 2015) (Challenge Track)
Florence, Italy
RESEARCH PROBLEM: HIGHER RATE OF
UNRESOLVED QUESTIONS
 Unresolved question:
none of the answers was
accepted as a solution.
 Exponential increase over
the last 6 years.
 2.4m (27%) unresolved
out of 8.8m questions at SO
(Feb, 2015)
RQ1: Why do questions at Stack Overflow remain unresolved for
long time?
RQ2: Can we predict the questions for which none of the answers
might be accepted as solutions?
2
ASPECTS OF STUDY
 Comparative analysis (RQ1)
between questions using four
aspects:
 Lexical Analysis
 Code Readability (CR)
 Text Readability (TR)
 Semantic Analysis
 Topic Similarity (TS)
 Topic Entropy (TE)
 User Behaviour Analysis
 Answer Rejection Ratio (ARR)
 Last Access Delay (LAD)
 Popularity Analysis
 Votes for Questions (V)
 Reputation of Question Owners (R)
Dataset Used
 3,956 Unresolved
questions & 4,101
Resolved questions
 Each question has at
least 10 answers.
3
CODE & TEXT READABILITY
 Existing readability tools used– Buse and Weimer (TSE,
2010) and Readability Grade levels (Ponzanelli et al, ICSME,
2014)
 Distribution Fitting Curves of readability
 No significant difference in readability between two
types of questions. 4
TOPIC SIMILARITY & TOPIC ENTROPY
 Mallet (McCallum, 2002) for topic modeling
 Topic Similarity (Fig-a) between questions and
corresponding answers identical for both question types.
 Topic Entropy (i.e., topic uncertainty) (Fig-b) higher for
unresolved questions– unresolved questions are
less specific about topics of requirement.
5
USER BEHAVIOUR ANALYSIS
 Distribution Fitting Curves of rejection ratio.
 Owners of unresolved questions have greater
answer rejection ratio.
 Owners of unresolved questions are less frequent
at Stack Overflow. 6
POPULARITY ANALYSIS
 Used Question Votes and User Reputation
 Unresolved questions are less popular than resolved
questions.
 Owners of unresolved questions are less reputed.
7
PREDICTION MODELS (RQ2)
Algorithm Metrics Overall
Accuracy
Unresolved Questions
Precision Recall
J48
{ TE, ARR, LAD, V, R } 78.11% 78.70% 76.10%
{ARR, LAD, V} 77.90% 79.60% 73.90%
Logistic
Regression
{ TE, ARR, LAD, V, R } 73.58% 72.60% 74.20%
{ARR, LAD, V} 73.28% 71.70% 75.20%
Naïve
Bayes
{ TE, ARR, LAD, V, R } 71.69% 69.50% 75.50%
{ARR, LAD, V} 74.48% 80.00% 64.00%
 Three prediction models used from WEKA with 10-fold
cross-validation.
 78.11% prediction accuracy with 78.70% precision
and 76.10% recall.
 The identified features are satisfactorily predictive.
8
TAKE-HOME MESSAGE
 27% of SO questions are unresolved, and they are
increasing almost exponentially.
 Unresolved questions are ambiguous, less
focused and less popular.
 Owners of unresolved questions are less reputed
and less frequent at SO.
 Identified features can satisfactorily separate
unresolved from resolved questions.
 Findings can assist in question quality
management at SO.
9
THANK YOU!!
10

More Related Content

PPTX
Learning analytics and accessibility – #calrg 2015
PPTX
ISEC-2021-Presentation-Saikat-Mondal
PPT
Algorithms for the thematic analysis of twitter datasets
PDF
SelQA: A New Benchmark for Selection-based Question Answering
PDF
Modeling language to support privacy requirements
PDF
PARCC Grade 7 Math
PDF
Marshall hm poster_vra2015
PDF
Ela g5
Learning analytics and accessibility – #calrg 2015
ISEC-2021-Presentation-Saikat-Mondal
Algorithms for the thematic analysis of twitter datasets
SelQA: A New Benchmark for Selection-based Question Answering
Modeling language to support privacy requirements
PARCC Grade 7 Math
Marshall hm poster_vra2015
Ela g5

What's hot (18)

PDF
Determining the Credibility of Science Communication
PDF
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PDF
PARCC Grade 6 Math
PDF
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
PDF
Ela g3
PDF
Ela g7
PPTX
Social networks
PDF
PARCC Grade 5 Math
PDF
Helping Prospective Students Understand the Computing Disciplines
PDF
Attracting Women to Computing and Why it Matters
PPT
Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
PDF
Computational Exploration of the Linguistic Structures of Future-Oriented Exp...
PDF
Semantics-based Graph Approach to Complex Question-Answering
PDF
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
PPT
Question Answering for Machine Reading Evaluation on Romanian and English
PPTX
NAACL2015 presentation
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Determining the Credibility of Science Communication
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PARCC Grade 6 Math
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
Ela g3
Ela g7
Social networks
PARCC Grade 5 Math
Helping Prospective Students Understand the Computing Disciplines
Attracting Women to Computing and Why it Matters
Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
Computational Exploration of the Linguistic Structures of Future-Oriented Exp...
Semantics-based Graph Approach to Complex Question-Answering
Ontology-Based Data Access Mapping Generation using Data, Schema, Query, and ...
Question Answering for Machine Reading Evaluation on Romanian and English
NAACL2015 presentation
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Ad

Similar to An Insight into the Unresolved Questions at Stack Overflow (20)

PPTX
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
PPTX
Towards Automated Supports for Code Reviews using Reviewer Recommendation and...
PPTX
R programming for psychometrics
PPTX
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
PPTX
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
PPTX
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
PDF
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
PPTX
The effect of number of concepts on readability of schemas 2
DOCX
Rubric Detail A rubric lists grading criteria that instruct.docx
PPTX
How to conduct systematic literature review
PPTX
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
PPTX
Zouaq wole2013
PDF
Question Classification using Semantic, Syntactic and Lexical features
PDF
Question Classification using Semantic, Syntactic and Lexical features
PPTX
A Set of Heuristics to Support Early Identification of Conflicting Requirements
PPTX
An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Q...
PPTX
CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
PPTX
Query Recommendation - Barcelona 2017
PPTX
An IDE-Based Context-Aware Meta Search Engine
PPTX
SurfClipse-- An IDE based context-aware Meta Search Engine (ERA Track)
Predicting Usefulness of Code Review Comments using Textual Features and Deve...
Towards Automated Supports for Code Reviews using Reviewer Recommendation and...
R programming for psychometrics
Recommending Insightful Comments for Source Code using Crowdsourced Knowledge
STRICT: Information Retrieval Based Search Term Identification for Concept Lo...
CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Te...
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
The effect of number of concepts on readability of schemas 2
Rubric Detail A rubric lists grading criteria that instruct.docx
How to conduct systematic literature review
Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineer...
Zouaq wole2013
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
A Set of Heuristics to Support Early Identification of Conflicting Requirements
An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Q...
CORRECT: Code Reviewer Recommendation at GitHub for Vendasta Technologies
Query Recommendation - Barcelona 2017
An IDE-Based Context-Aware Meta Search Engine
SurfClipse-- An IDE based context-aware Meta Search Engine (ERA Track)
Ad

More from Masud Rahman (20)

PDF
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
PDF
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
PDF
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
PPTX
HereWeCode 2022: Dalhousie University
PPTX
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PPTX
PhD Seminar - Masud Rahman, University of Saskatchewan
PPTX
PhD proposal of Masud Rahman
PPTX
PhD Comprehensive exam of Masud Rahman
PPTX
Doctoral Symposium of Masud Rahman
PPTX
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
PDF
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
PDF
Impact of Continuous Integration on Code Reviews
PPTX
An Insight into the Pull Requests of GitHub
PPTX
TextRank Based Search Term Identification for Software Change Tasks
PPTX
CMPT-842-BRACK
PPTX
RACK: Code Search in the IDE using Crowdsourced Knowledge
PPTX
RACK: Automatic API Recommendation using Crowdsourced Knowledge
PPTX
QUICKAR: Automatic Query Reformulation for Concept Location Using Crowdsource...
PPTX
Improved Query Reformulation for Concept Location using CodeRank and Document...
PPTX
CMPT470-usask-guest-lecture
Explaining Software Bugs Leveraging Code Structures in Neural Machine Transla...
Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
Improved Detection and Diagnosis of Faults in Deep Neural Networks Using Hier...
HereWeCode 2022: Dalhousie University
The Forgotten Role of Search Queries in IR-based Bug Localization: An Empiric...
PhD Seminar - Masud Rahman, University of Saskatchewan
PhD proposal of Masud Rahman
PhD Comprehensive exam of Masud Rahman
Doctoral Symposium of Masud Rahman
Supporting Source Code Search with Context-Aware and Semantics-Driven Code Se...
Poster: Improving Bug Localization with Report Quality Dynamics and Query Ref...
Impact of Continuous Integration on Code Reviews
An Insight into the Pull Requests of GitHub
TextRank Based Search Term Identification for Software Change Tasks
CMPT-842-BRACK
RACK: Code Search in the IDE using Crowdsourced Knowledge
RACK: Automatic API Recommendation using Crowdsourced Knowledge
QUICKAR: Automatic Query Reformulation for Concept Location Using Crowdsource...
Improved Query Reformulation for Concept Location using CodeRank and Document...
CMPT470-usask-guest-lecture

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Per capita expenditure prediction using model stacking based on satellite ima...
1. Introduction to Computer Programming.pptx
Getting Started with Data Integration: FME Form 101
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
A comparative analysis of optical character recognition models for extracting...
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25-Week II
Programs and apps: productivity, graphics, security and other tools
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Tartificialntelligence_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Group 1 Presentation -Planning and Decision Making .pptx
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release

An Insight into the Unresolved Questions at Stack Overflow

  • 1. AN INSIGHT INTO THE UNRESOLVED QUESTIONS AT STACK OVERFLOW Mohammad Masudur Rahman, Chanchal K. Roy Department of Computer Science University of Saskatchewan Presented By: Ripon K. Saha 12th Working Conference on Mining Software Repositories (MSR 2015) (Challenge Track) Florence, Italy
  • 2. RESEARCH PROBLEM: HIGHER RATE OF UNRESOLVED QUESTIONS  Unresolved question: none of the answers was accepted as a solution.  Exponential increase over the last 6 years.  2.4m (27%) unresolved out of 8.8m questions at SO (Feb, 2015) RQ1: Why do questions at Stack Overflow remain unresolved for long time? RQ2: Can we predict the questions for which none of the answers might be accepted as solutions? 2
  • 3. ASPECTS OF STUDY  Comparative analysis (RQ1) between questions using four aspects:  Lexical Analysis  Code Readability (CR)  Text Readability (TR)  Semantic Analysis  Topic Similarity (TS)  Topic Entropy (TE)  User Behaviour Analysis  Answer Rejection Ratio (ARR)  Last Access Delay (LAD)  Popularity Analysis  Votes for Questions (V)  Reputation of Question Owners (R) Dataset Used  3,956 Unresolved questions & 4,101 Resolved questions  Each question has at least 10 answers. 3
  • 4. CODE & TEXT READABILITY  Existing readability tools used– Buse and Weimer (TSE, 2010) and Readability Grade levels (Ponzanelli et al, ICSME, 2014)  Distribution Fitting Curves of readability  No significant difference in readability between two types of questions. 4
  • 5. TOPIC SIMILARITY & TOPIC ENTROPY  Mallet (McCallum, 2002) for topic modeling  Topic Similarity (Fig-a) between questions and corresponding answers identical for both question types.  Topic Entropy (i.e., topic uncertainty) (Fig-b) higher for unresolved questions– unresolved questions are less specific about topics of requirement. 5
  • 6. USER BEHAVIOUR ANALYSIS  Distribution Fitting Curves of rejection ratio.  Owners of unresolved questions have greater answer rejection ratio.  Owners of unresolved questions are less frequent at Stack Overflow. 6
  • 7. POPULARITY ANALYSIS  Used Question Votes and User Reputation  Unresolved questions are less popular than resolved questions.  Owners of unresolved questions are less reputed. 7
  • 8. PREDICTION MODELS (RQ2) Algorithm Metrics Overall Accuracy Unresolved Questions Precision Recall J48 { TE, ARR, LAD, V, R } 78.11% 78.70% 76.10% {ARR, LAD, V} 77.90% 79.60% 73.90% Logistic Regression { TE, ARR, LAD, V, R } 73.58% 72.60% 74.20% {ARR, LAD, V} 73.28% 71.70% 75.20% Naïve Bayes { TE, ARR, LAD, V, R } 71.69% 69.50% 75.50% {ARR, LAD, V} 74.48% 80.00% 64.00%  Three prediction models used from WEKA with 10-fold cross-validation.  78.11% prediction accuracy with 78.70% precision and 76.10% recall.  The identified features are satisfactorily predictive. 8
  • 9. TAKE-HOME MESSAGE  27% of SO questions are unresolved, and they are increasing almost exponentially.  Unresolved questions are ambiguous, less focused and less popular.  Owners of unresolved questions are less reputed and less frequent at SO.  Identified features can satisfactorily separate unresolved from resolved questions.  Findings can assist in question quality management at SO. 9

Editor's Notes

  • #2: Introduce yourself +introductory statements. Today, I am going to talk about the findings on unresolved questions from Stack Overflow.
  • #3: First, lets clarify unresolved questions We refer to such questions as unresolved which are posted at least 6 months ago, but none of the posted answers are accepted as solutions. Right now, SO has 27% of such questions and they increased almost exponentially over the last 6 years. So, in this paper we answer two research questions: Why do questions at Stack Overflow remain unresolved for long time? Can we develop a model that would predict unresolved questions?
  • #4: For answering RQ1, we conduct a comparative study between unresolved and resolved questions (answer accepted as solution) from SO. We collect about 4K questions of each type, and compare them using four different analysis: Lexical analysis which includes checking for readability of code and text in the questions. Semantic analysis which focuses on question-answer topic similarity and topic entropy. User behaviuor analysis focuses on certain activities of the question owners. Popularity analysis compares questions votes and user reputation for both types of questions.
  • #5: This slide shows the readability comparison between unresolved and resolved questions. Green refers to readability distribution fit for resolved questions, and red means the same for unresolved questions. We find no significant difference in the readability of both questions.
  • #6: However, we got an interesting finding in case of question topics. Using topic modeling and information theory, we calculate topic entropy (analogous to Information entropy) for both resolved and unresolved questions. We found that topic entropy is higher for unresolved questions which suggests that Unresolved questions are less specific about requirements , that means less focused, which probably prevents them from satisfactory answers.
  • #7: In case of user behaviour analysis, we found that owners of unresolved questions are relatively reluctant in accepting answers as solution which suggest they are either careless or skeptical. Our analysis also shows that they are less frequent in SO.
  • #8: In case of popularity analysis, we found that unresolved questions are less popular than resolved questions, and owners of unresolved questions are generally less reputed than the owners of resolved questions.
  • #9: Now, in order to answer RQ2, we use the identified features in RQ1, and collect features for both question types (8K) We then develop 3 prediction models using J48, Logistic regression and Naïve Bayes from WEKA, and apply 10-fold cross-validation. We found a overall classification accuracy of 78.11% which is impressive. In case of unresolved questions, we found 80% precision and 76.10% recall which suggests that the identified features are quite predictive.
  • #10: So, here are the take-home messages: 27% of SO questions are unresolved and they are increasing almost exponentially. Unresolved questions are ambiguous, less focused and less popular Owners of unresolved questions are less reputed and less frequent at SO The identified features in this study are quite predictive for unresolved questions. So, they can be used for question quality management.
  • #11: Thanks for your time. Questions!!