SlideShare a Scribd company logo
Finding Similar Projects in GitHub using
Word2Vec and WMD
MD MASUDUR RAHMAN
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
1
Introduction
Given project details (description
and source code), the aim is to find
functionally similar projects
Finding functionally similar project
is important
Application/project recommendation
Code re-use, rapid prototyping
Discovering code plagiarism
CS@UVa 2
Code re-use Plagiarism checking
Application/project
Recommendation
How developer search for similar
projects?
General Purpose Search(Google)
CS@UVa 3
Query: android browser
Try to find application relevant to the query
Not intended to search for source code
GitHub Search: android browser
CS@UVa 4
Mostly keyword based search on textual
contents
Project name, description, etc.
Open and analyze jar, class, apk, etc.
Might rank irrelevant projects at the top
Less textual content
Use source code content
 Augment content by Method, Class, and API name
Model Workflow
5
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
Model Workflow
6
GitHub
Projects
Data Preprocessing
(per feature)
(Tokenization, Normalization,
Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction
(Description, Readme, Method
& Class Name, API Package
Name, API Class name)
Document Generation
(combined all features)
Search
Interface
Candidate
Project
Documents
Query Project
Documents
Document Similarity
Computation
(Word2Vec, WMD)
Search Result
(Ranked list of similar projects)
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
7
Keyword based Cosine similarity
Bag of Word (BOW)
Document 2: android photo viewer
No common keyword!
Cosine similarity = 0
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
8
Document 2: android photo viewer
Word Embedding
𝑤1
𝑤3𝑤2
𝑤4
CS@UVa
Word Embedding
“You shall know a word by the company it keeps” –J. R. Firth 1957
9
Open source upgrade path for Odoo/OpenERP
Plugin to check for obvious upgrade points on the path to 3.0
Codes related to upgrade project
Demo app to demonstrate how to upgrade from Angular 1 to Angular 2
 Learn word vector for upgrade by its surrounding words
 Word2Vec
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
CS@UVa
Word2Vec
Input: Text corpus
CS@UVa 10
0.286
0.792
-0.171
-0.105
0.544
0.351
-0.653
0.274
upgrade
Word2Vec
Model
Word Embedding
Output: Word vectorsTraining
Word2Vec Model
CS@UVa 11
Document: image gallery app for android
Skip-gram
image
gallery
app
for
android
Example Word Embedding
In Embedded space
Similar meaning word clustered together
CS@UVa 12
image
photo
picture figure
sample
example
demo illustration
upgrade update
modify
change
install setup
launch
change
dimension size
height
length
range
Embedding for each word
How to get document/sentence level similarity?
 Word Mover’s Distance (WMD)
Word Mover’s Distance(WMD)
CS@UVa 13
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 14
image LollipopappgalleryD1
android viewerphotoD2
0.1
0.50.7
Word Mover’s Distance
CS@UVa 15
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.20.6
Word Mover’s Distance
CS@UVa 16
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.150.2
Word Mover’s Distance
CS@UVa 17
image LollipopappgalleryD1
android viewerphotoD2
0.4
0.30.1
Word Mover’s Distance
Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55
Smaller score means more similar
CS@UVa 18
image LollipopappgalleryD1
android viewerphotoD2
0.15
0.2
0.1
0.1
Preliminary Results
19
Project Name Description Project Type
Query/
Rank
android_browser
Customize android webclient
(source code with readme file)
Lightning based
android browser
1 Myfacebook MyFacebook source code Lightning based
android browser
2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser,
and licensed under the Mozilla Public License, v. 2.0..
Lightning based
android browser
3 Web-browser Web browser is based on Lightning Browser, and licensed
under the Mozilla Public License, v. 2.0..
Lightning based
android browser
4 JumpGo JumpGo Web Browser for Android JumpGo Android
Browser
5 VChrome Build an test browser for Viettel in job interview Android Browser
CS@UVa
Summary
We proposed a model for finding functionally similar projects in GitHub
Used textual and source code content to construct document
Measured similarity between document adopting Word Mover’s Distance
Leveraged Word2Vec word embedding
20
Reference
Word2vec : Gensim python library
https://radimrehurek.com/gensim/models/word2vec.html
WMD
 https://github.com/mkusner/wmd
Wikipedia Dump.
https://dumps.wikimedia.org/enwiki/
GitHub Projects Data: The GHTorrent project
http://ghtorrent.org/
21CS@UVa
Question?
22CS@UVa

More Related Content

PDF
Shaloo Verma
PDF
Using Git and GitHub Effectively at Emerge Interactive
PDF
O'Leary - Using GitHub for Enterprise and Open Source Documentation
PPTX
Become a Successful Web Developer in Web development Field in 2017
ODP
MySQL 101 PHPTek 2017
PDF
How GitHub Builds Software at Ruby Conference Kenya 2017 by Mike McQuaid
PPTX
Introduction to github using Egit
PPTX
MySQL Replication Evolution -- Confoo Montreal 2017
Shaloo Verma
Using Git and GitHub Effectively at Emerge Interactive
O'Leary - Using GitHub for Enterprise and Open Source Documentation
Become a Successful Web Developer in Web development Field in 2017
MySQL 101 PHPTek 2017
How GitHub Builds Software at Ruby Conference Kenya 2017 by Mike McQuaid
Introduction to github using Egit
MySQL Replication Evolution -- Confoo Montreal 2017

Similar to Finding Similar Projects in GitHub using Word2Vec and WMD (20)

PDF
Azure ARM Template
PDF
Microsoft graph and power platform champ
PPT
Vsts intro
PDF
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
PPTX
Googleappengineintro 110410190620-phpapp01
DOCX
Complete resource for web development
PPTX
Introduction to meteor
PPT
SciVerse Application Integration Points
PPTX
Azure DevOps for Developers
ODP
Develop FOSS project using Google Code Hosting
PPT
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
PPTX
Advanced JavaScript
PDF
2025’s Ultimate Tech Stack Cheat Sheet for Building Killer Web Apps
PDF
Learn Programming Languages & Get Programming Assignment Sample Solutions PDF...
PPTX
Introduction to Google App Engine with Python
PPT
Searching Repositories of Web Application Models
PDF
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
PPTX
Asp.net Programming Training (Web design, Web development)
DOCX
COMP6210 Web Services And Design Methodologies.docx
PPTX
Azure DevOps for the Data Professional
Azure ARM Template
Microsoft graph and power platform champ
Vsts intro
WordCamp Asheville 2017 - So You Wanna Dev? Join the Team!
Googleappengineintro 110410190620-phpapp01
Complete resource for web development
Introduction to meteor
SciVerse Application Integration Points
Azure DevOps for Developers
Develop FOSS project using Google Code Hosting
Final Jspring2009 Mda Slimmer Ontwikkelen Van Java Ee Applicaties
Advanced JavaScript
2025’s Ultimate Tech Stack Cheat Sheet for Building Killer Web Apps
Learn Programming Languages & Get Programming Assignment Sample Solutions PDF...
Introduction to Google App Engine with Python
Searching Repositories of Web Application Models
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
Asp.net Programming Training (Web design, Web development)
COMP6210 Web Services And Design Methodologies.docx
Azure DevOps for the Data Professional
Ad

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
1. Introduction to Computer Programming.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Assigned Numbers - 2025 - Bluetooth® Document
A comparative analysis of optical character recognition models for extracting...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Group 1 Presentation -Planning and Decision Making .pptx
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Ad

Finding Similar Projects in GitHub using Word2Vec and WMD

  • 1. Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF VIRGINIA 1
  • 2. Introduction Given project details (description and source code), the aim is to find functionally similar projects Finding functionally similar project is important Application/project recommendation Code re-use, rapid prototyping Discovering code plagiarism CS@UVa 2 Code re-use Plagiarism checking Application/project Recommendation How developer search for similar projects?
  • 3. General Purpose Search(Google) CS@UVa 3 Query: android browser Try to find application relevant to the query Not intended to search for source code
  • 4. GitHub Search: android browser CS@UVa 4 Mostly keyword based search on textual contents Project name, description, etc. Open and analyze jar, class, apk, etc. Might rank irrelevant projects at the top Less textual content Use source code content  Augment content by Method, Class, and API name
  • 5. Model Workflow 5 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 6. Model Workflow 6 GitHub Projects Data Preprocessing (per feature) (Tokenization, Normalization, Stemming, Stopwords Removal, TF-IDF score based word filtering) Feature Extraction (Description, Readme, Method & Class Name, API Package Name, API Class name) Document Generation (combined all features) Search Interface Candidate Project Documents Query Project Documents Document Similarity Computation (Word2Vec, WMD) Search Result (Ranked list of similar projects) CS@UVa
  • 7. How to measure document similarity? Document 1: image gallery app for Lollipop 7 Keyword based Cosine similarity Bag of Word (BOW) Document 2: android photo viewer No common keyword! Cosine similarity = 0 CS@UVa
  • 8. How to measure document similarity? Document 1: image gallery app for Lollipop 8 Document 2: android photo viewer Word Embedding 𝑤1 𝑤3𝑤2 𝑤4 CS@UVa
  • 9. Word Embedding “You shall know a word by the company it keeps” –J. R. Firth 1957 9 Open source upgrade path for Odoo/OpenERP Plugin to check for obvious upgrade points on the path to 3.0 Codes related to upgrade project Demo app to demonstrate how to upgrade from Angular 1 to Angular 2  Learn word vector for upgrade by its surrounding words  Word2Vec 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade CS@UVa
  • 10. Word2Vec Input: Text corpus CS@UVa 10 0.286 0.792 -0.171 -0.105 0.544 0.351 -0.653 0.274 upgrade Word2Vec Model Word Embedding Output: Word vectorsTraining
  • 11. Word2Vec Model CS@UVa 11 Document: image gallery app for android Skip-gram image gallery app for android
  • 12. Example Word Embedding In Embedded space Similar meaning word clustered together CS@UVa 12 image photo picture figure sample example demo illustration upgrade update modify change install setup launch change dimension size height length range Embedding for each word How to get document/sentence level similarity?  Word Mover’s Distance (WMD)
  • 13. Word Mover’s Distance(WMD) CS@UVa 13 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 14. Word Mover’s Distance CS@UVa 14 image LollipopappgalleryD1 android viewerphotoD2 0.1 0.50.7
  • 15. Word Mover’s Distance CS@UVa 15 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.20.6
  • 16. Word Mover’s Distance CS@UVa 16 image LollipopappgalleryD1 android viewerphotoD2 0.35 0.150.2
  • 17. Word Mover’s Distance CS@UVa 17 image LollipopappgalleryD1 android viewerphotoD2 0.4 0.30.1
  • 18. Word Mover’s Distance Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55 Smaller score means more similar CS@UVa 18 image LollipopappgalleryD1 android viewerphotoD2 0.15 0.2 0.1 0.1
  • 19. Preliminary Results 19 Project Name Description Project Type Query/ Rank android_browser Customize android webclient (source code with readme file) Lightning based android browser 1 Myfacebook MyFacebook source code Lightning based android browser 2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0.. Lightning based android browser 4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser 5 VChrome Build an test browser for Viettel in job interview Android Browser CS@UVa
  • 20. Summary We proposed a model for finding functionally similar projects in GitHub Used textual and source code content to construct document Measured similarity between document adopting Word Mover’s Distance Leveraged Word2Vec word embedding 20
  • 21. Reference Word2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html WMD  https://github.com/mkusner/wmd Wikipedia Dump. https://dumps.wikimedia.org/enwiki/ GitHub Projects Data: The GHTorrent project http://ghtorrent.org/ 21CS@UVa

Editor's Notes

  • #2: Hello Everyone, I am Masudur Rahman. I am a PhD student at Department of Computer Science of University of Virginia. I will present our work, finding similar project in GitHub where we used Word Mover Distance and Word2Vec word embedding.
  • #3: Finding Functionally similar project is very important fo ap recommendation, code re-use, rapid prototyiping and plagiarism checking
  • #4: There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) No surprise! Google try to find out application based on the search engine, and they are not intended to do project level search for finding source code. We might augment the query to get some meaning results for the developer but, the intent of these general purpose search engine will remain same and it will try to find application not source code that developer might willing to use
  • #5: There is no convenient way to search for similar project using all the project information (souce code, description, readme etc.) We will see how we incorporated this method, class and API name to augment the textual information
  • #8: Let’s consider this two documents, there is no common keyword in this document thus keyword based cosine similarity will give us 0, that means they are totally dissimilar, but actually they are not, they even represent same meaning. And in project documentation developer often use different word to represent he same thing. Though these two documents are similar in meaning, normal keyword based similarity cannot capture these.
  • #9: If we look into closely, android and lollipop are similar in meaning. Same for other keywords as well. Now, instead of matching words exactly, can we give some value between these two words that will indicate how much similar they are in meaning. Yes we can. Learn a weight w where higher weight mean strongly similar and lower weight mean less similar
  • #10: Intuition: The context words of similar words would be same. One of the most effective way of doing this is: Word2Vec
  • #14: How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  • #15: How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  • #16: How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  • #17: How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  • #18: How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity
  • #19: How we can convert this word level information to sentence level similarity. WMD leveraged this word embedding to get sentence level similarity