SlideShare a Scribd company logo
using git metadata to
predict code bug risk
git risky
J. Henry Hinnefeld
hhinnefeld@civisanalytics.com
hinnefe2
DrJSomeday
Outline
1. intro to git
2. the model
3. git tips
intro to git & git metadata
5Civis Analytics | Proprietary and Confidential
what is git?
1. a version control tool
6Civis Analytics | Proprietary and Confidential
what is git?
1. a version control tool
2. a source of interesting
metadata about the history
of your codebase
7Civis Analytics | Proprietary and Confidential
what is a git commit?
git records changes to your code in units called ‘commits’.
● a single commit can contain changes to multiple files
● each commit gets assigned a unique identifier
● each commit also has metadata attached to it.
8Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
○ the commit message
○ which files and lines were changed
○ who made the change
9Civis Analytics | Proprietary and Confidential
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
getting at git metadata
10Civis Analytics | Proprietary and Confidential
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
getting at git metadata
11Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git diff shows the changes between two commits
12Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git blame identifies the last commit to modify a
particular line of code
codecommit
building a code bug risk model
14Civis Analytics | Proprietary and Confidential
build a model of code bug risk!
1. identify and label commits which introduced bugs
2. build commit-level features
3. train a binary classifier on the features and labels
4. score each new commit with the model using git hooks
what can you do with git metadata?
15Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find commits which fix a bug by checking for commit
messages which start with ‘BUG’ or ‘Fix’
● identify the lines which were changed in that commit
using git diff
● find the last commit to modify those lines using git blame
● label that commit as having introduced a bug
16Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
caveat: this depends heavily on having good ‘commit
hygiene’
● each commit should only do one thing
● commit messages should be meaningful and standardized
at the end we’ll go over some tips for better commit hygiene
17Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find commits which fix a bug by checking for commit
messages which start with ‘BUG’ or ‘Fix’
git log 
-i --grep='bug' --grep='fix' 
--pretty=oneline --abbrev-commit
18Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● identify the lines which were changed in that commit
using git diff
git diff <commit hash>^ <commit hash> -U0
19Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find the last commit to modify those lines using git blame
git blame <file path> <commit hash>^ 
-L<start line #>,<stop line #>
after
before
20Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● label that commit as having introduced a bug
with a little python string parsing and some
subprocess.check_output calls we can repeat this
process for each bugfix commit, and so label all commits by
whether or not they introduced a bug.
21Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
22Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
23Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
24Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
AUC = 0.76
25Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
26Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
again, this really requires good ‘commit hygiene’
● each commit should only do one thing
● commit messages should be meaningful and standardized
27Civis Analytics | Proprietary and Confidential
step 4: score each new commit with the model using git hooks
git hooks are small scripts in <your repo>/.git/hooks/ that
run automatically based on certain git actions.
some available hooks are:
● pre-commit
● commit-msg
● post-commit
28Civis Analytics | Proprietary and Confidential
for model
scoring we
want the
post-commit
hook
step 4: score each new commit with the model using git hooks
<repo>/.git/hooks/post-commit
29Civis Analytics | Proprietary and Confidential
step 4: score each new commit with the model using git hooks
tips for better commit hygiene
31Civis Analytics | Proprietary and Confidential
why care about commit hygiene?
● clean commit histories can help you understand the
evolution of the codebase at a glance.
● it adds ‘documentation’ to every line of code: to
understand some confusing lines read the commit
messages of the commits which added those lines.
● it’s not that hard.
32Civis Analytics | Proprietary and Confidential
improving git commit messages
use the commit-msg hook to
● reject commit messages that are too short
● enforce standards around message tags
(eg BUG, TST, WIP, etc)
33Civis Analytics | Proprietary and Confidential
what if I already made lots of edits to one file?
git add --patch <file>
if you’ve made lots of
edits to a single file
you can split those
edits into separate
commits with the
--patch option
34Civis Analytics | Proprietary and Confidential
what if I already have a messy commit history?
git rebase -i <base commit hash>
interactive rebasing lets you rewrite your commit history:
you can combine, split, reorder, and reword commits
35Civis Analytics | Proprietary and Confidential
what if I already have a messy commit history?
git rebase -i <base commit hash>
with rebase -i it is super easy to make tons of small commits
as you work and then quickly combine them into a clean
commit history.
warning: only rebase code you haven’t shared with anyone
yet. <base commit hash> should be the most recent commit
that other people can see (eg the upstream master).
questions?

More Related Content

PDF
Git in the Enterprise: How to succeed at DevOps using Git and a monorepo
PPTX
Gerrit & Jenkins Workflow: An Integrated CI Demonstration
PDF
Magento, beginning to end
PDF
Gerrit Code Review
PPTX
Using Git/Gerrit and Jenkins to Manage the Code Review Processord
PPTX
Git session 1
PPTX
[Public] gerrit concepts and workflows
PDF
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)
Git in the Enterprise: How to succeed at DevOps using Git and a monorepo
Gerrit & Jenkins Workflow: An Integrated CI Demonstration
Magento, beginning to end
Gerrit Code Review
Using Git/Gerrit and Jenkins to Manage the Code Review Processord
Git session 1
[Public] gerrit concepts and workflows
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)

What's hot (17)

PDF
Preventing Supply Chain Attacks on Open Source Software
PDF
WTF is GitOps and Why You Should Care?
PPTX
Git for work groups ironhack talk
PDF
Writing Commits for You, Your Friends, and Your Future Self
PDF
Git in 10 minutes (WordCamp London 2018)
PDF
Learning Git and GitHub - BIT GDSC.pdf
PDF
The printing press of 2021 - using GitLab to publish the VSHN Handbook
PPTX
You can git
PDF
Opencast Architecture
ODP
Git Demo
PPTX
Github
PPTX
Jenkins plugin for Gerrit Code Review pipelines
PPTX
Docs or it didn’t happen
PDF
Continuous integration for Ruby on Rails
PDF
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
PDF
Git for (collaborative) writing
PDF
Version control with Git & GitHub
Preventing Supply Chain Attacks on Open Source Software
WTF is GitOps and Why You Should Care?
Git for work groups ironhack talk
Writing Commits for You, Your Friends, and Your Future Self
Git in 10 minutes (WordCamp London 2018)
Learning Git and GitHub - BIT GDSC.pdf
The printing press of 2021 - using GitLab to publish the VSHN Handbook
You can git
Opencast Architecture
Git Demo
Github
Jenkins plugin for Gerrit Code Review pipelines
Docs or it didn’t happen
Continuous integration for Ruby on Rails
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
Git for (collaborative) writing
Version control with Git & GitHub
Ad

Similar to Git risky using git metadata to predict code bug risk (20)

PDF
Git+jenkins+rex presentation
ODP
Introduction to Git
PPTX
Introduction to Git and Github
PDF
Mwalls velocity levelup
PPTX
Git Github GDSC.pptx
PPTX
Git workshop 33degree 2011 krakow
PDF
Ln monitoring repositories
PDF
Getting Git Right
PDF
Getting some Git
PPT
Introduction to Git
PDF
Vulnerability Detection Based on Git History
PDF
Updated non-lab version of Level Up. Delivered at LOPSA-East, May 3, 2014.
PPTX
Learn Git - For Beginners and Intermediate levels
PDF
Git: A Getting Started Presentation
PDF
Open Source Tools for Leveling Up Operations FOSSET 2014
PPTX
Git and GitHub
PDF
Version control with GIT
PDF
Embedded Systems: Lecture 11: Introduction to Git & GitHub (Part 2)
PPTX
Working with GIT
Git+jenkins+rex presentation
Introduction to Git
Introduction to Git and Github
Mwalls velocity levelup
Git Github GDSC.pptx
Git workshop 33degree 2011 krakow
Ln monitoring repositories
Getting Git Right
Getting some Git
Introduction to Git
Vulnerability Detection Based on Git History
Updated non-lab version of Level Up. Delivered at LOPSA-East, May 3, 2014.
Learn Git - For Beginners and Intermediate levels
Git: A Getting Started Presentation
Open Source Tools for Leveling Up Operations FOSSET 2014
Git and GitHub
Version control with GIT
Embedded Systems: Lecture 11: Introduction to Git & GitHub (Part 2)
Working with GIT
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PDF
Words in Space - Rebecca Bilbro
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Pydata beautiful soup - Monica Puerto
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
Extending Pandas with Custom Types - Will Ayd
PDF
Measuring Model Fairness - Stephen Hoover
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Words in Space - Rebecca Bilbro
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Pydata beautiful soup - Monica Puerto
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Extending Pandas with Custom Types - Will Ayd
Measuring Model Fairness - Stephen Hoover
What's the Science in Data Science? - Skipper Seabold
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

Recently uploaded (20)

PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
KodekX | Application Modernization Development
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Chapter 2 Digital Image Fundamentals.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
The Rise and Fall of 3GPP – Time for a Sabbatical?
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
madgavkar20181017ppt McKinsey Presentation.pdf
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KodekX | Application Modernization Development
Advanced Soft Computing BINUS July 2025.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Monthly Chronicles - July 2025
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
CIFDAQ's Market Insight: SEC Turns Pro Crypto
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Chapter 2 Digital Image Fundamentals.pdf

Git risky using git metadata to predict code bug risk

  • 1. using git metadata to predict code bug risk git risky
  • 3. Outline 1. intro to git 2. the model 3. git tips
  • 4. intro to git & git metadata
  • 5. 5Civis Analytics | Proprietary and Confidential what is git? 1. a version control tool
  • 6. 6Civis Analytics | Proprietary and Confidential what is git? 1. a version control tool 2. a source of interesting metadata about the history of your codebase
  • 7. 7Civis Analytics | Proprietary and Confidential what is a git commit? git records changes to your code in units called ‘commits’. ● a single commit can contain changes to multiple files ● each commit gets assigned a unique identifier ● each commit also has metadata attached to it.
  • 8. 8Civis Analytics | Proprietary and Confidential getting at git metadata three git commands are particularly useful for extracting git metadata: ● git log shows the history of changes to a project. different arguments expose different data, e.g. ○ the commit message ○ which files and lines were changed ○ who made the change
  • 9. 9Civis Analytics | Proprietary and Confidential three git commands are particularly useful for extracting git metadata: ● git log shows the history of changes to a project. different arguments expose different data, e.g. getting at git metadata
  • 10. 10Civis Analytics | Proprietary and Confidential three git commands are particularly useful for extracting git metadata: ● git log shows the history of changes to a project. different arguments expose different data, e.g. getting at git metadata
  • 11. 11Civis Analytics | Proprietary and Confidential getting at git metadata three git commands are particularly useful for extracting git metadata: ● git diff shows the changes between two commits
  • 12. 12Civis Analytics | Proprietary and Confidential getting at git metadata three git commands are particularly useful for extracting git metadata: ● git blame identifies the last commit to modify a particular line of code codecommit
  • 13. building a code bug risk model
  • 14. 14Civis Analytics | Proprietary and Confidential build a model of code bug risk! 1. identify and label commits which introduced bugs 2. build commit-level features 3. train a binary classifier on the features and labels 4. score each new commit with the model using git hooks what can you do with git metadata?
  • 15. 15Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● find commits which fix a bug by checking for commit messages which start with ‘BUG’ or ‘Fix’ ● identify the lines which were changed in that commit using git diff ● find the last commit to modify those lines using git blame ● label that commit as having introduced a bug
  • 16. 16Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs caveat: this depends heavily on having good ‘commit hygiene’ ● each commit should only do one thing ● commit messages should be meaningful and standardized at the end we’ll go over some tips for better commit hygiene
  • 17. 17Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● find commits which fix a bug by checking for commit messages which start with ‘BUG’ or ‘Fix’ git log -i --grep='bug' --grep='fix' --pretty=oneline --abbrev-commit
  • 18. 18Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● identify the lines which were changed in that commit using git diff git diff <commit hash>^ <commit hash> -U0
  • 19. 19Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● find the last commit to modify those lines using git blame git blame <file path> <commit hash>^ -L<start line #>,<stop line #> after before
  • 20. 20Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● label that commit as having introduced a bug with a little python string parsing and some subprocess.check_output calls we can repeat this process for each bugfix commit, and so label all commits by whether or not they introduced a bug.
  • 21. 21Civis Analytics | Proprietary and Confidential step 2: build commit-level features if we want to build a model we need features and labels. we just made some labels, so next let’s make some features. git log --stat
  • 22. 22Civis Analytics | Proprietary and Confidential step 2: build commit-level features if we want to build a model we need features and labels. we just made some labels, so next let’s make some features. git log --stat
  • 23. 23Civis Analytics | Proprietary and Confidential step 2: build commit-level features if we want to build a model we need features and labels. we just made some labels, so next let’s make some features. git log --stat
  • 24. 24Civis Analytics | Proprietary and Confidential step 3: train a binary classifier on the features and labels AUC = 0.76
  • 25. 25Civis Analytics | Proprietary and Confidential step 3: train a binary classifier on the features and labels
  • 26. 26Civis Analytics | Proprietary and Confidential step 3: train a binary classifier on the features and labels again, this really requires good ‘commit hygiene’ ● each commit should only do one thing ● commit messages should be meaningful and standardized
  • 27. 27Civis Analytics | Proprietary and Confidential step 4: score each new commit with the model using git hooks git hooks are small scripts in <your repo>/.git/hooks/ that run automatically based on certain git actions. some available hooks are: ● pre-commit ● commit-msg ● post-commit
  • 28. 28Civis Analytics | Proprietary and Confidential for model scoring we want the post-commit hook step 4: score each new commit with the model using git hooks <repo>/.git/hooks/post-commit
  • 29. 29Civis Analytics | Proprietary and Confidential step 4: score each new commit with the model using git hooks
  • 30. tips for better commit hygiene
  • 31. 31Civis Analytics | Proprietary and Confidential why care about commit hygiene? ● clean commit histories can help you understand the evolution of the codebase at a glance. ● it adds ‘documentation’ to every line of code: to understand some confusing lines read the commit messages of the commits which added those lines. ● it’s not that hard.
  • 32. 32Civis Analytics | Proprietary and Confidential improving git commit messages use the commit-msg hook to ● reject commit messages that are too short ● enforce standards around message tags (eg BUG, TST, WIP, etc)
  • 33. 33Civis Analytics | Proprietary and Confidential what if I already made lots of edits to one file? git add --patch <file> if you’ve made lots of edits to a single file you can split those edits into separate commits with the --patch option
  • 34. 34Civis Analytics | Proprietary and Confidential what if I already have a messy commit history? git rebase -i <base commit hash> interactive rebasing lets you rewrite your commit history: you can combine, split, reorder, and reword commits
  • 35. 35Civis Analytics | Proprietary and Confidential what if I already have a messy commit history? git rebase -i <base commit hash> with rebase -i it is super easy to make tons of small commits as you work and then quickly combine them into a clean commit history. warning: only rebase code you haven’t shared with anyone yet. <base commit hash> should be the most recent commit that other people can see (eg the upstream master).