"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Source code abstracts
classiﬁcation using CNN
Vadim Markovtsev, source{d}

goo.gl/sd7wsm
(view this on your device)
Plan
1. Motivation
2. Source code feature engineering
3. The Network
4. Results
5. Other work

Motivation
Everything is better with clusters.
“

Motivation
Customers buy goods, and software developers write code.

Motivation
So to understand the latter, we need to understand what and how they do
what they do. Feature origins:
• Social networks
• Version control statistics
• History
• Style
• Source code
• Algorithms
• Dependency graph
• Style

Motivation
Let's check how deep we can drill with source code style ML.
Toy task: binary classiﬁcation between 2 projects using only the data with the
origin in code style.

Feature engineering
Requirements:
1. Ignore text ﬁles, Markdown, etc.
2. Ignore autogenerated ﬁles
3. Support many languages with minimal efforts
4. Include as much information about the source code as possible

Feature engineering
(1) and (2) are solved by  github/linguist and source{d}'s own tool
• Used by GihHub for language bars
• Supports 400+ languages

Feature engineering
(3) and (4) are solved by
• Highlights source code (tokenizer)
• Supports 400+ languages (though only 50% intersects with github/linguist)
• ≈90 token types (not all are used for every language)

Feature engineering
Pygments example:
# prints "Hello, World!"
if True:
print("Hello, World!")
# prints "Hello, World!"
if True:
print("Hello, World!")
01.
02.
03.

Feature engineering
Token.Comment.Single '# prints "Hello, World!"'
Token.Text 'n'
Token.Keyword 'if'
Token.Text ' '
Token.Name.Builtin.Pseudo 'True'
Token.Punctuation ':'
Token.Text 'n'
Token.Text ' '
Token.Keyword 'print'
Token.Punctuation '('
Token.Literal.String.Double '"'
Token.Literal.String.Double 'Hello, World!'
Token.Literal.String.Double '"'
Token.Punctuation ')'
Token.Text 'n'
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.
14.
15.

Feature engineering
• Split stream into lines, each line contains ≤40 tokens
• Merge indents
• "One against all" with value length
• Some tokens occupy more than 1 dimension, e.g. Token.Name reﬂects
naming style
• About 200 dimensions overall
• 8000 features per line, most are zeros
• Mean-dispersion normalization

Feature engineering
Though extracted, names as words may not used in this scheme.
We've checked out two approaches to using this extra information:
1. LSTM sequence modelling (link to presentation)
2. ARTM topic modelling (article in our blog)

The Network
layer kernel pooling number
convolutional 4x1 2x1 250
all2all 512
all2all 64
all2all output

The Network
Activation ReLU
Optimizer GD with momentum (0.5)
Learning rate 0.002
Weight decay 0.955
Regularization L2, 0.0005
Weight initialization σ = 0.1

The Network
• Merge all project files together, feed 50 LOT (lines of tokens) as a single
sample.
• Does not converge without random shuffling files (sample borders are of
course fixed).
• Batch size is 50.
• Truncate projects by the smallest LOT.
• Fragile to small metaparameter deviations.

The Network
• Python3 / Tensorﬂow / NVIDIA GPU
• Preprocessing is done on Dataproc (Spark)
• Database of features is stored in Cloud Storage
• Sparse matrices ⇒normalization on the ﬂy

Results
projects description size accuracy
Django vs Twisted Web frameworks, Python 800ktok each 84%
Matplotlib vs Bokeh Plotting libraries, Python 1Mtok vs 250ktok 60%
Matplotlib vs Django Plotting libraries, Python 1Mtok vs 800ktok 76%
Django vs Guava Python vs Java 800ktok >99%
Hibernate vs Guava Java libraries 3Mtok vs 800ktok 96%

Results
Conclusion: the network is likely to extract internal similarity in each project
and use it. Just like humans do.
If the languages are different, it is very easy to distinguish projects (at least
because of unique token types).

Results
Problem: how to get this for a source code network?

Other work
GitHub has ≈6M of active users (and 3M after reasonable ﬁltering). If we are
able to extract various features for each, we can cluster them. Visio:
1. Run K-means with K=45000 (using src-d/kmcuda)
2. Run t-SNE to visualize the landscape
BTW, kmcuda implements Yinyang k-means.

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

Other work
Article.
ASP
ActionScript
Ada
Apex
Apollo Guidance Computer
AppleScript Arc
Arduino
AsciiDoc
AspectJ
Assembly
AutoHotkey
AutoIt
Awk
Batchﬁle
Brainfuck
C
C#
C++CLIPS
CMake
COBOL
COLLADA
CSS
CSV
ChucK
Click
Clojure
CoffeeScript
ColdFusion
ColdFusion CFC
Common Lisp
Component Pascal
Coq
Csound DocumentCsound Score
Cucumber
Cuda
Cython
D
DIGITAL Command Language
DM
DNS Zone
DTrace
Dart
Diff
EJS
Eagle
Eiffel
Elixir
Elm
Emacs Lisp
Erlang
F#
FORTRAN
Forth
FreeMarker
Frege
G-code
GAP
GAS
GLSL
Genshi
Gentoo Ebuild
Gettext Catalog
Gnuplot
Go
Gradle
Graphviz (DOT)
Groff
Groovy
Groovy Server Pages
HCL
HLSL
HTML
HTML+Django
HTML+ERB
HTML+PHP
HTTP
Haml
Handlebars
Haskell
Haxe
IGOR Pro
INI
JFlex
JSON
JSONLD
JSX
Jade
Jasmin
Java
Java Server Pages
JavaScript
Julia
Jupyter Notebook
KiCad
LLVM
Lasso
Less
Lex
LilyPond
Limbo
Linker Script
Linux Kernel Module
LiquidLiterate Haskell
LiveScript
Logos
Lua
M
M4
MAXScript
MUF
Makeﬁle
Markdown
Mathematica
Matlab
Max
MediaWiki
Modelica
Moocode
NSIS
NetLogo
NewLisp
Nix
OCaml
ObjDump
Objective-C
Objective-C++
Objective-J
OpenCL
OpenEdge ABL
OpenSCAD
Org
PAWN
PHP
PLSQL
PLpgSQL
POV-Ray SDL
Pascal
Perl
Perl6
Pickle
Pod
PostScript
PowerShell
Processing
Prolog
Protocol Buffer
Public Key
Puppet
Pure Data
PureBasic
Python
QML
QMake
R
RAML
RDoc
RHTML
RMarkdown
Racket
Ragel in Ruby Host
Raw token data
Ruby
Rust
SAS
SCSS
SMT
SQF
SQL
SQLPL
SRecode Template
SVG
Sass
Scala
Scheme
Scilab
Shell
Slash
Slim
Smali
Smarty
SourcePawn
Squirrel
Standard ML
Stata
Stylus
SuperCollider
Swift
SystemVerilog
Tcl
TeX
Text
Textile
Turing
Turtle
TwigTypeScript
Unity3D Asset
VHDLVala
Verilog
VimL
Visual Basic
Vue
Wavefront Material
Wavefront Object
Web Ontology Language
XML
XProc
XQuery
XS
XSLT
YAML
Yacc
edn
mupad
nesC
reStructuredText
xBase
spaces tabs mixed
© source{d} CC-BY-SA 4.0

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to "Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d} (20)

More from Dataconomy Media (20)

Recently uploaded (20)

"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}