SlideShare a Scribd company logo
Source code abstracts
classification using CNN
Vadim Markovtsev, source{d}
goo.gl/sd7wsm
(view this on your device)
Plan
1. Motivation
2. Source code feature engineering
3. The Network
4. Results
5. Other work
Motivation
Everything is better with clusters.
“
Motivation
Customers buy goods, and software developers write code.
Motivation
So to understand the latter, we need to understand what and how they do
what they do. Feature origins:
• Social networks
• Version control statistics
• History
• Style
• Source code
• Algorithms
• Dependency graph
• Style
Motivation
Motivation
Let's check how deep we can drill with source code style ML.
Toy task: binary classification between 2 projects using only the data with the
origin in code style.
Feature engineering
Requirements:
1. Ignore text files, Markdown, etc.
2. Ignore autogenerated files
3. Support many languages with minimal efforts
4. Include as much information about the source code as possible
Feature engineering
(1) and (2) are solved by  github/linguist and source{d}'s own tool
• Used by GihHub for language bars
• Supports 400+ languages
Feature engineering
(3) and (4) are solved by
• Highlights source code (tokenizer)
• Supports 400+ languages (though only 50% intersects with github/linguist)
• ≈90 token types (not all are used for every language)
Feature engineering
Pygments example:
# prints "Hello, World!"
if True:
print("Hello, World!")
# prints "Hello, World!"
if True:
print("Hello, World!")
01.
02.
03.
Feature engineering
Token.Comment.Single '# prints "Hello, World!"'
Token.Text 'n'
Token.Keyword 'if'
Token.Text ' '
Token.Name.Builtin.Pseudo 'True'
Token.Punctuation ':'
Token.Text 'n'
Token.Text ' '
Token.Keyword 'print'
Token.Punctuation '('
Token.Literal.String.Double '"'
Token.Literal.String.Double 'Hello, World!'
Token.Literal.String.Double '"'
Token.Punctuation ')'
Token.Text 'n'
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.
14.
15.
Feature engineering
Feature engineering
• Split stream into lines, each line contains ≤40 tokens
• Merge indents
• "One against all" with value length
• Some tokens occupy more than 1 dimension, e.g. Token.Name reflects
naming style
• About 200 dimensions overall
• 8000 features per line, most are zeros
• Mean-dispersion normalization
Feature engineering
Though extracted, names as words may not used in this scheme.
We've checked out two approaches to using this extra information:
1. LSTM sequence modelling (link to presentation)
2. ARTM topic modelling (article in our blog)
Feature engineering
The Network
layer kernel pooling number
convolutional 4x1 2x1 250
convolutional 8x2 2x2 200
convolutional 5x6 2x2 150
convolutional 2x10 2x2 100
all2all 512
all2all 64
all2all output
The Network
Activation ReLU
Optimizer GD with momentum (0.5)
Learning rate 0.002
Weight decay 0.955
Regularization L2, 0.0005
Weight initialization σ = 0.1
The Network
• Merge all project files together, feed 50 LOT (lines of tokens) as a single
sample.
• Does not converge without random shuffling files (sample borders are of
course fixed).
• Batch size is 50.
• Truncate projects by the smallest LOT.
• Fragile to small metaparameter deviations.
The Network
• Python3 / Tensorflow / NVIDIA GPU
• Preprocessing is done on Dataproc (Spark)
• Database of features is stored in Cloud Storage
• Sparse matrices ⇒normalization on the fly
Results
projects description size accuracy
Django vs Twisted Web frameworks, Python 800ktok each 84%
Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%
Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%
Django vs Guava Python vs Java 800ktok >99%
Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
Results
Conclusion: the network is likely to extract internal similarity in each project
and use it. Just like humans do.
If the languages are different, it is very easy to distinguish projects (at least
because of unique token types).
Results
Results
Problem: how to get this for a source code network?
Other work
GitHub has ≈6M of active users (and 3M after reasonable filtering). If we are
able to extract various features for each, we can cluster them. Visio:
1. Run K-means with K=45000 (using src-d/kmcuda)
2. Run t-SNE to visualize the landscape
BTW, kmcuda implements Yinyang k-means.
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Other work
Article.
ASP
ActionScript
Ada
Apex
Apollo Guidance Computer
AppleScript Arc
Arduino
AsciiDoc
AspectJ
Assembly
AutoHotkey
AutoIt
Awk
Batchfile
Brainfuck
C
C#
C++CLIPS
CMake
COBOL
COLLADA
CSS
CSV
ChucK
Click
Clojure
CoffeeScript
ColdFusion
ColdFusion CFC
Common Lisp
Component Pascal
Coq
Csound DocumentCsound Score
Cucumber
Cuda
Cython
D
DIGITAL Command Language
DM
DNS Zone
DTrace
Dart
Diff
EJS
Eagle
Eiffel
Elixir
Elm
Emacs Lisp
Erlang
F#
FORTRAN
Forth
FreeMarker
Frege
G-code
GAP
GAS
GLSL
Genshi
Gentoo Ebuild
Gettext Catalog
Gnuplot
Go
Gradle
Graphviz (DOT)
Groff
Groovy
Groovy Server Pages
HCL
HLSL
HTML
HTML+Django
HTML+ERB
HTML+PHP
HTTP
Haml
Handlebars
Haskell
Haxe
IGOR Pro
INI
JFlex
JSON
JSONLD
JSX
Jade
Jasmin
Java
Java Server Pages
JavaScript
Julia
Jupyter Notebook
KiCad
LLVM
Lasso
Less
Lex
LilyPond
Limbo
Linker Script
Linux Kernel Module
LiquidLiterate Haskell
LiveScript
Logos
Lua
M
M4
MAXScript
MUF
Makefile
Markdown
Mathematica
Matlab
Max
MediaWiki
Modelica
Moocode
NSIS
NetLogo
NewLisp
Nix
OCaml
ObjDump
Objective-C
Objective-C++
Objective-J
OpenCL
OpenEdge ABL
OpenSCAD
Org
PAWN
PHP
PLSQL
PLpgSQL
POV-Ray SDL
Pascal
Perl
Perl6
Pickle
Pod
PostScript
PowerShell
Processing
Prolog
Protocol Buffer
Public Key
Puppet
Pure Data
PureBasic
Python
QML
QMake
R
RAML
RDoc
RHTML
RMarkdown
Racket
Ragel in Ruby Host
Raw token data
Ruby
Rust
SAS
SCSS
SMT
SQF
SQL
SQLPL
SRecode Template
SVG
Sass
Scala
Scheme
Scilab
Shell
Slash
Slim
Smali
Smarty
SourcePawn
Squirrel
Standard ML
Stata
Stylus
SuperCollider
Swift
SystemVerilog
Tcl
TeX
Text
Textile
Turing
Turtle
TwigTypeScript
Unity3D Asset
VHDLVala
Verilog
VimL
Visual Basic
Vue
Wavefront Material
Wavefront Object
Web Ontology Language
XML
XProc
XQuery
XS
XSLT
YAML
Yacc
edn
mupad
nesC
reStructuredText
xBase
spaces tabs mixed
© source{d} CC-BY-SA 4.0
Other work
Article.
Other work
Before:
After:
Thank you
We are hiring!

More Related Content

PDF
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
PDF
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
PPT
"An Introduction to Kx Technology: A Big Data Solution" Chris Leckey, a Data ...
PDF
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
PDF
Capital One: Using Cassandra In Building A Reporting Platform
PDF
Big data serving: Processing and inference at scale in real time
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Polyglot persistence @ netflix (CDE Meetup)
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"An Introduction to Kx Technology: A Big Data Solution" Chris Leckey, a Data ...
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Capital One: Using Cassandra In Building A Reporting Platform
Big data serving: Processing and inference at scale in real time
Spark Summit EU talk by Kaarthik Sivashanmugam
Polyglot persistence @ netflix (CDE Meetup)

What's hot (20)

PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
PDF
Spark Summit EU talk by Zoltan Zvara
PDF
Engineering fast indexes
PDF
KDB database (EPAM tech talks, Sofia, April, 2015)
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
PPTX
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
PPTX
Improving Organizational Knowledge with Natural Language Processing Enriched ...
PPTX
Graph Databases at Netflix
PDF
Proofpoint: Fraud Detection and Security on Social Media
PDF
Managing Cassandra Databases with OpenStack Trove
PDF
The Future of Real-Time in Spark
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PPTX
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
PDF
Lambda architecture @ Indix
PDF
Introducing Kafka Connect and Implementing Custom Connectors
PPTX
Quark Virtualization Engine for Analytics
PDF
Macy's: Changing Engines in Mid-Flight
PDF
FlinkML - Big data application meetup
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PDF
Introduction to TitanDB
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit EU talk by Zoltan Zvara
Engineering fast indexes
KDB database (EPAM tech talks, Sofia, April, 2015)
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Graph Databases at Netflix
Proofpoint: Fraud Detection and Security on Social Media
Managing Cassandra Databases with OpenStack Trove
The Future of Real-Time in Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
Lambda architecture @ Indix
Introducing Kafka Connect and Implementing Custom Connectors
Quark Virtualization Engine for Analytics
Macy's: Changing Engines in Mid-Flight
FlinkML - Big data application meetup
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Introduction to TitanDB
Ad

Viewers also liked (20)

PDF
connected_issue_49_summer_2013
PPT
SNLI_presentation_2
PDF
Recurrent Convolutional Neural Networks for Text Classification
PDF
Temporal Action Localization in Untrimmed Videos via Multi Stage CNNs
PPTX
Vectorland: Brief Notes from Using Text Embeddings for Search
PPTX
Convolutional neural networks for sentiment classification
PDF
ConvolutionalNeuralNetworks
PDF
Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016
PDF
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
PDF
Can Deep Learning solve the Sentiment Analysis Problem
PDF
Comparing Incremental Learning Strategies for Convolutional Neural Networks
PDF
Deep Convolutional Neural Networks - Overview
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPTX
CNN for Text Classification
PDF
CNNs: from the Basics to Recent Advances
PDF
101: Convolutional Neural Networks
PPTX
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
PPT
Writing a Procedure Text
PDF
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
connected_issue_49_summer_2013
SNLI_presentation_2
Recurrent Convolutional Neural Networks for Text Classification
Temporal Action Localization in Untrimmed Videos via Multi Stage CNNs
Vectorland: Brief Notes from Using Text Embeddings for Search
Convolutional neural networks for sentiment classification
ConvolutionalNeuralNetworks
Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016
Automatic Tagging using Deep Convolutional Neural Networks - ISMIR 2016
Can Deep Learning solve the Sentiment Analysis Problem
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Deep Convolutional Neural Networks - Overview
Neural Text Embeddings for Information Retrieval (WSDM 2017)
CNN for Text Classification
CNNs: from the Basics to Recent Advances
101: Convolutional Neural Networks
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Writing a Procedure Text
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Ad

Similar to "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d} (20)

PPTX
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
PPTX
GluonCV
PPTX
Anomaly Detection with Azure and .NET
PDF
Accelerating open science and AI with automated, portable, customizable and r...
PDF
Convolutional neural network
PDF
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
PDF
IRJET- Python Libraries and Packages for Deep Learning-A Survey
PDF
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
PPTX
Deduplication on large amounts of code
PPT
Hands on Mahout!
DOCX
Course Title CS591-Advance Artificial Intelligence
PDF
Graph Neural Network in practice
PPTX
YU CS Summer 2021 Project | TensorFlow Street Image Classification and Object...
PDF
Wise Document Translator Report
PDF
Deep-Learning-with-PydddddddddddddTorch.pdf
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
PPTX
python_libraries_for_artificial_intelligence.pptx
PPTX
Machine Learning Toolssssssssssssss.pptx
PPTX
Anomaly Detection with Azure and .net
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
GluonCV
Anomaly Detection with Azure and .NET
Accelerating open science and AI with automated, portable, customizable and r...
Convolutional neural network
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
IRJET- Python Libraries and Packages for Deep Learning-A Survey
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Deduplication on large amounts of code
Hands on Mahout!
Course Title CS591-Advance Artificial Intelligence
Graph Neural Network in practice
YU CS Summer 2021 Project | TensorFlow Street Image Classification and Object...
Wise Document Translator Report
Deep-Learning-with-PydddddddddddddTorch.pdf
Soumith Chintala - Increasing the Impact of AI Through Better Software
python_libraries_for_artificial_intelligence.pptx
Machine Learning Toolssssssssssssss.pptx
Anomaly Detection with Azure and .net

More from Dataconomy Media (20)

PDF
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
PDF
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
PDF
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
PDF
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
PPTX
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
PPTX
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
PPTX
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
PDF
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
PPTX
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
PDF
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
PPTX
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
PDF
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
PDF
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
PDF
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
PDF
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
PPTX
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
PDF
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
PPTX
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
PPTX
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
PPTX
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...

Recently uploaded (20)

PDF
Introduction to the R Programming Language
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to the R Programming Language
Reliability_Chapter_ presentation 1221.5784
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
oil_refinery_comprehensive_20250804084928 (1).pptx
Mega Projects Data Mega Projects Data
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Qualitative Qantitative and Mixed Methods.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
ISS -ESG Data flows What is ESG and HowHow
IBA_Chapter_11_Slides_Final_Accessible.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Clinical guidelines as a resource for EBP(1).pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
Data_Analytics_and_PowerBI_Presentation.pptx

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

  • 1. Source code abstracts classification using CNN Vadim Markovtsev, source{d}
  • 2. goo.gl/sd7wsm (view this on your device) Plan 1. Motivation 2. Source code feature engineering 3. The Network 4. Results 5. Other work
  • 3. Motivation Everything is better with clusters. “
  • 4. Motivation Customers buy goods, and software developers write code.
  • 5. Motivation So to understand the latter, we need to understand what and how they do what they do. Feature origins: • Social networks • Version control statistics • History • Style • Source code • Algorithms • Dependency graph • Style
  • 7. Motivation Let's check how deep we can drill with source code style ML. Toy task: binary classification between 2 projects using only the data with the origin in code style.
  • 8. Feature engineering Requirements: 1. Ignore text files, Markdown, etc. 2. Ignore autogenerated files 3. Support many languages with minimal efforts 4. Include as much information about the source code as possible
  • 9. Feature engineering (1) and (2) are solved by  github/linguist and source{d}'s own tool • Used by GihHub for language bars • Supports 400+ languages
  • 10. Feature engineering (3) and (4) are solved by • Highlights source code (tokenizer) • Supports 400+ languages (though only 50% intersects with github/linguist) • ≈90 token types (not all are used for every language)
  • 11. Feature engineering Pygments example: # prints "Hello, World!" if True: print("Hello, World!") # prints "Hello, World!" if True: print("Hello, World!") 01. 02. 03.
  • 12. Feature engineering Token.Comment.Single '# prints "Hello, World!"' Token.Text 'n' Token.Keyword 'if' Token.Text ' ' Token.Name.Builtin.Pseudo 'True' Token.Punctuation ':' Token.Text 'n' Token.Text ' ' Token.Keyword 'print' Token.Punctuation '(' Token.Literal.String.Double '"' Token.Literal.String.Double 'Hello, World!' Token.Literal.String.Double '"' Token.Punctuation ')' Token.Text 'n' 01. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.
  • 14. Feature engineering • Split stream into lines, each line contains ≤40 tokens • Merge indents • "One against all" with value length • Some tokens occupy more than 1 dimension, e.g. Token.Name reflects naming style • About 200 dimensions overall • 8000 features per line, most are zeros • Mean-dispersion normalization
  • 15. Feature engineering Though extracted, names as words may not used in this scheme. We've checked out two approaches to using this extra information: 1. LSTM sequence modelling (link to presentation) 2. ARTM topic modelling (article in our blog)
  • 17. The Network layer kernel pooling number convolutional 4x1 2x1 250 convolutional 8x2 2x2 200 convolutional 5x6 2x2 150 convolutional 2x10 2x2 100 all2all 512 all2all 64 all2all output
  • 18. The Network Activation ReLU Optimizer GD with momentum (0.5) Learning rate 0.002 Weight decay 0.955 Regularization L2, 0.0005 Weight initialization σ = 0.1
  • 19. The Network • Merge all project files together, feed 50 LOT (lines of tokens) as a single sample. • Does not converge without random shuffling files (sample borders are of course fixed). • Batch size is 50. • Truncate projects by the smallest LOT. • Fragile to small metaparameter deviations.
  • 20. The Network • Python3 / Tensorflow / NVIDIA GPU • Preprocessing is done on Dataproc (Spark) • Database of features is stored in Cloud Storage • Sparse matrices ⇒normalization on the fly
  • 21. Results projects description size accuracy Django vs Twisted Web frameworks, Python 800ktok each 84% Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60% Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76% Django vs Guava Python vs Java 800ktok >99% Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%
  • 22. Results Conclusion: the network is likely to extract internal similarity in each project and use it. Just like humans do. If the languages are different, it is very easy to distinguish projects (at least because of unique token types).
  • 24. Results Problem: how to get this for a source code network?
  • 25. Other work GitHub has ≈6M of active users (and 3M after reasonable filtering). If we are able to extract various features for each, we can cluster them. Visio: 1. Run K-means with K=45000 (using src-d/kmcuda) 2. Run t-SNE to visualize the landscape BTW, kmcuda implements Yinyang k-means.
  • 27. Other work Article. ASP ActionScript Ada Apex Apollo Guidance Computer AppleScript Arc Arduino AsciiDoc AspectJ Assembly AutoHotkey AutoIt Awk Batchfile Brainfuck C C# C++CLIPS CMake COBOL COLLADA CSS CSV ChucK Click Clojure CoffeeScript ColdFusion ColdFusion CFC Common Lisp Component Pascal Coq Csound DocumentCsound Score Cucumber Cuda Cython D DIGITAL Command Language DM DNS Zone DTrace Dart Diff EJS Eagle Eiffel Elixir Elm Emacs Lisp Erlang F# FORTRAN Forth FreeMarker Frege G-code GAP GAS GLSL Genshi Gentoo Ebuild Gettext Catalog Gnuplot Go Gradle Graphviz (DOT) Groff Groovy Groovy Server Pages HCL HLSL HTML HTML+Django HTML+ERB HTML+PHP HTTP Haml Handlebars Haskell Haxe IGOR Pro INI JFlex JSON JSONLD JSX Jade Jasmin Java Java Server Pages JavaScript Julia Jupyter Notebook KiCad LLVM Lasso Less Lex LilyPond Limbo Linker Script Linux Kernel Module LiquidLiterate Haskell LiveScript Logos Lua M M4 MAXScript MUF Makefile Markdown Mathematica Matlab Max MediaWiki Modelica Moocode NSIS NetLogo NewLisp Nix OCaml ObjDump Objective-C Objective-C++ Objective-J OpenCL OpenEdge ABL OpenSCAD Org PAWN PHP PLSQL PLpgSQL POV-Ray SDL Pascal Perl Perl6 Pickle Pod PostScript PowerShell Processing Prolog Protocol Buffer Public Key Puppet Pure Data PureBasic Python QML QMake R RAML RDoc RHTML RMarkdown Racket Ragel in Ruby Host Raw token data Ruby Rust SAS SCSS SMT SQF SQL SQLPL SRecode Template SVG Sass Scala Scheme Scilab Shell Slash Slim Smali Smarty SourcePawn Squirrel Standard ML Stata Stylus SuperCollider Swift SystemVerilog Tcl TeX Text Textile Turing Turtle TwigTypeScript Unity3D Asset VHDLVala Verilog VimL Visual Basic Vue Wavefront Material Wavefront Object Web Ontology Language XML XProc XQuery XS XSLT YAML Yacc edn mupad nesC reStructuredText xBase spaces tabs mixed © source{d} CC-BY-SA 4.0
  • 30. Thank you We are hiring!