XConf 2022 - Code As Data: How data insights on legacy codebases can fill the knowledge gap in complex modernization projects.

© 2022 Thoughtworks | Confidential
CaD: (Code As Data)
How data insights
on legacy codebases
can fill the knowledge gap
in complex modernization projects

Why legacy modernization
is always so challenging?
2
We usually rely on people who built
the software (i.e. DEVs) or the ones
dealing with it (e.g. SMEs, Users)
to collect knowledge about it
(e.g. what it does?, why?).
But people come and go
and knowledge is scattered or lost…
What if source code could tell
some interesting facts
about itself too?
1998 2016

© 2022 Thoughtworks | Confidential 3
What People
Think Code Is
CaG
C(ode) a(s) G(ibberish)
What Developers
Think Code Is
CaL
C(ode) a(s) L(iterature)
What Computers
Think Code Is
CaD
C(ode) a(s) D(data)

#include <stdio.h>
main( ) {
printf("hello, world");
}
Hello, World
Kernighan, Brian W.; Ritchie, Dennis M. (1978). The C Programming Language (1st ed.). Englewood Cliffs, NJ: Prentice Hall. ISBN 0-13-110163-3.
(developer view)

$ clang -Xclang -ast-dump -fsyntax-only hello-world.c
1 5 10 15 20 25 30
1 # i n c l u d e < s t d i o . h >
2
3 m a i n ( ) {
4 p r i n t f ( “ h e l l o , w o r l d “ ) ;
5 }
5
where token
who
what
semantics
Hello, World
(computer view)
AST
syntax
Abstract Syntax Tree
↳
↳
↳
↳
↳
↳
↳

TYPE ID PARENT_ID FILE LINE COLUMN TEXT
FunctionDecl 0x7fedd489f830 hello-world.c 3 main
CompoundStmt 0x7fedd489fa10 0x7fedd489f830 hello-world.c 3 12
CallExpr 0x7fedd489f9b8 0x7fedd489fa10 hello-world.c 4 3
ImplicitCastExpr 0x7fedd489f9a0 0x7fedd489f9b8 hello-world.c 4 3
DeclRefExpr 0x7fedd489f8d0 0x7fedd489f9a0 hello-world.c 4 3 printf
ImplicitCastExpr 0x7fedd489f9f8 0x7fedd489f9b8 hello-world.c 4 10
ImplicitCastExpr 0x7fedd489f9e0 0x7fedd489f9f8 hello-world.c 4 10
StringLiteral 0x7fedd489f928 0x7fedd489f9e0 hello-world.c 4 10 Hello, World!n
$ clang -Xclang -ast-dump -fsyntax-only hello-world.c
1 5 10 15 20 25 30
1 # i n c l u d e < s t d i o . h >
2
3 m a i n ( ) {
4 p r i n t f ( “ h e l l o , w o r l d “ ) ;
5 }
where token
who
what
semantics
Hello, World
(data view)
AST
syntax
Abstract Syntax Tree
↳
↳
↳
↳
↳
↳
↳

Java
class HelloWorld {
public static void main(String[]
args) {
System.out.println("Hello, world!");
}
}
C#
using System;
class Program
{
public static void Main(string[]
args)
{
Console.WriteLine("Hello,
world!");
}
}
Python
print("Hello, world!")
Ruby
puts "Hello,
world!"
Scala
object HelloWorld extends
App {
println("Hello, world!")
}
ASP.NET
Response.Write("Hello World!");
Lisp
(princ "Hello, world!")
Haskell
main = putStrLn "Hello, world!"
Malbolge
('&%:9]!~}|z2Vxwv-,POqponl$Hjig%eB@@>}=<M:9wv6WsU
2T|nm-,jcL(I&%$#"
`CB]V?Tx<uVtT`Rpo3NlF.Jh++FdbCBA@?]!~|4XzyTT43Qsq
q(Lnmkj"Fhg${z@>
Hello, (all) World
there is a parser for every programming language…
abb, abnf, acme, agc, alef, fix, algol60, alloy, alpaca, angelscript, space, antlr, apex, apt, argus, arithmetic, asl, asm, asn, aspectj, atl, b, basic, bcl, bcl, bcpl,
bdf, bibcode, bnf, brainflak, brainfuck, c, calculator, callable, capnproto, cayenne, symbol conflicts, clf, clif, clojure, clu, cmake, cobol85, cookie, cool, cpp,
cql, cql3, creole, csharp, css3, csv, ctl, cto, dart2, databank, dcm, dgol, dice, dif, doiurl, dot, edif300, edn, erlang, fasta, fdo91, fen, flatbuffers, flowmatic,
fixes, focal, fol, fortran77, fusion-tables, gdscript, gedcom, gff3, gml, golang, graphql, graphstream-dgs, gtin, guido, guitartab, haskell, html, http, hypertalk,
icalendar, icon, idl, inf, informix, infosapient, iri, iso8601, istc, itn, first commit, janus, java, javadoc, javascript, joss, jpa, json, json5, karel, karel, kirikiri-tjs,
kotlin, kquery, kuka, lambda, lark, lcc, less, limbo, lisa, logo, lolcode, loop, lpc, lrc, ltl, lua, lucene, matlab, mckeeman-form, mdx, memcached_protocol,
metamath, metric, microc, modelica, modula2pim4, molecule, moo, morsecode, mps, muddb, mumath, mumps, muparser, nanofuck, newick, oberon, objc,
oncrpc, orwell, p, parkingsign, pascal, pcre, pddl, pdn, peoplecode, pgn, php, pii, pike, pl0, plucid, fix, ply, pmmn, postalcode, powerbuilder, powerbuilderdw,
powerquery, prolog, promql, propcalc, properties, protobuf2, protobuf3, prov-n, python, qif, quakemap, r, racket-bsl, racket-isl, rcs, redcode, refal,
comments, rego, restructuredtext, rexx, rfc1035, rfc1960, rfc3080, rfc822, robotwars, romannumerals, rpn, ruby, rust, last month, scala, scotty, scss, last
month, sexpression, sgf, sharc, sici, sickbay, sieve, smalltalk, smiles, smtlibv2, snobol, snowball, more, solidity, sparql, spass, sql, stacktrace, stellaris, stl,
stringtemplate, suokif, swift-fin, swift, tcpheader, teal, telephone, terraform, thrift, tiny, tinybasic, tinyc, tinymud, tinyos_nesc, tl, tnsnames, tnt, toml, trac,
tsv, ttm, turing, turtle-doc, turtle, unicode, unreal_angelscript, upnp, url, useragent, v, vb6, vba, velocity, verilog, vhdl, vmf, wat, wavefront, webidl, wkt, wln,
wren, xml, xpath, xsd-regex, xyz, z
opensource ANTLR grammars available at https://github.com/antlr/grammars-v4
…or you can use

Compilers use all metadata to translate
code into executable instructions.
What is all this for?
Is there anything that can be leveraged
by Business Analysts, Project
Managers, or IT managers at large*?
8
Static code analysis tools (e.g. linters)
use AST metadata to identify potential
issues (e.g. programming errors, bugs,
non idiomatic code, and suspicious
constructs, metrics).
IDEs use filtered metadata (e.g.
variables, functions, classes, methods) to
provide navigation, hints, and code
completion.
DEVs use actually the same metadata
(unconsciously) to read the code!
* and DEVs too…

Known Knowns
Identified Knowledge
Known Unknowns
Identified Risk
Unknown Knowns
Untapped Knowledge
Unknown
Unknowns
Unidentified Risk
Legacy Modernization
Challenges…
proactively reactively
discoveries, inceptions,
user stories, acceptance criteria
spikes, RAIDs,
modernization patterns
cross-functional teams,
short iterations, IPMs
… where Agile helps
Agile tools & techniques help
to proactively address KK & KU
and reactively UK & UU.

CaD can help to proactively mitigate
the risks about UK & UU.
… where can CaD help?
10
Known Knowns
Identified Knowledge
Known Unknowns
Identified Risk
Unknown Knowns
Untapped Knowledge
Unknown
Unknowns
Unidentified Risk
proactively
Legacy Modernization
Challenges…
proactively reactively

Example n. 1 - Unknown Knowns
project-level support for BAs & tech analysis
11

Project-level risks mitigation
Use Case: Modernization of a Pricing Engine
We were asked to replace a pricing engine
under development for the past 30 years.
12
We went through an inception and several workshops
with stakeholders, SMEs, DEVs.
We collected all the available knowledge (KK),
and identified all the grey areas that would require
further investigations (KU).
Are these really just
all the business
rules?
We found out that business rules were encoded
as table rows (e.g. exception/inclusion rules)
or field values (e.g. operation rules and values),
referenced and manipulated inside legacy code.
SMEs and DEVs told us that there are only 60 tables
to care about…

Inception
Proactively untap knowledge
Consolidate SMEs knowledge
BAs
Refine the project scope
PO
Trigger SMEs conversations
BAs
Explore grey areas
BAs
identify missing tables/proc
identify referenced fields
Explore legacy code
Parse legacy code
DEVs
DEVs
BAs
BAs
CaD Pipeline & Tools
Goals:
● Tactical: it should not require a huge
investment in time and resources
(i.e. DEVs should not have to become legacy
code experts)
● Pragmatic: just search for possible clues
(e.g. tables, fields, procedures not mentioned
in the workshops)
● Accessible: BAs and SMEs should be able to
use and explore the outcomes
(i.e. use tools they already know)
Use Case: Modernization of a pricing engine

CaD Pipeline Goals
● Tactical: ~300 clojure LoC leveraging
an existing open-source legacy
language ANTLR-based parser.
● Pragmatic: leveraging semantic
features of the legacy language
to filter tokens (never underestimate
the expressiveness of an old
programming language)
● Accessible: the output was a
spreadsheet that could be easily
filtered by table and column name,
or explored with pivot tables.
All the tokens were connected
(via Excel hyperlinks) to tables
documentation and specific
line/column of the source code
in VS Code (with syntax coloring
thanks to an open-source plugin).
DB Catalog Parse
Tables & Columns
Metadata
Tables/Fields
Names
1.1k tables
11k fields
Table List Parse
Project Scope
SMEs
60 tables
??? fields
Filter
& Merge
Tokens
referencing
Tables & Fields
128 tables
1k fields
4.7k tokens
Source Code Parse
P
r
o
j
e
c
t
S
c
o
p
e
4M LoC
22k LoC
117k tokens
+
Browser
Online Docs
BAs/SMEs
XLSX
Excel
BAs
VS Code
4th gen language
VS Code plugin
DEVs/SMEs

Easy interoperability
with Java libraries
Easy access to ANTRL
objects and attributes and
XLSX libraries to read/write
large files.
Fast in-memory parallel
data transformations
Clojure transducers and
core.async libraries provide
easy & fast parallel
in-memory transformations
without requiring huge
resources or infra.
REPL driven
development
The REPL allow an instant
feedback workflow that can
dramatically speed up
exploring Java libraries and
data structures.
Why Clojure?
Clojure is a fast modern Lisp that runs on top of the JVM (and CRI/V8 too).
because we love parenthesis ;)

How it looks like
16
Excel Spreadsheet
VS Code editor: token context
Token
Table’s online docs
Hyperlink Hyperlink
Dataset
Source
Code
DB
SME
DEVs/SMEs
BAs/SMEs
BAs
● parse unit path: source file path
● file path: original source file path
(may be different in case of include
file)
● line: token line inside the source file
● column: token starting column
inside the line
● source docs: link to VS code to
highlight the token inside the source
file
● type: token semantics tag
● text: token actual text
● node id: token id (parse unit context)
● parent id: token AST parent
● level: token AST indentation
● procedure id: procedure uuid
● procedure name: procedure name
● table name: matching table
● column name: matching column
● table docs: link to table’s online docs
● ambiguous term: true/false
● in scope: true/false

Are these really just
all the business
rules?
Use Case: Modernization of a pricing engine
CaD outcomes
● Tables: +40% more tables in scope
(some were edge cases, other seldom used)
● Fields: scope reduced to 36% of fields
(most fields were used for other purposes)
● Business Rules: whenever there was
a computation issue we could go
to the exact point in the source code
to clarify assumptions and behaviors

Example n. 2 - Unknown Unknowns
program-level support for legacy modernization
18

Program-level risks mitigation
Use Case: legacy ERP modernization
We were asked to replace an existing monolithic
on-prem ERP-like system made of several modules,
and under development for the past 30 years.
19
We went through a discovery and several workshops
with DEVs, OPSs, DBAs, SMEs, and business
stakeholders.
We defined a target functional & tech architecture
(KK), and identified modernisation patterns & RAIDs
(KU) with tentative mitigations.
how much
is going to cost?
where should
we start from?
We found out that the system was integrated
with several business processes, exchanging data
with many applications, and everybody was scared
of breaking something…

20

21
There should be a
wire somewhere on
your left…
Or maybe it’s on the
right… but don’t cut
the other ones!

We collected an amazing
amount of information
inside stickies
of different colors
and shape.
What if we could
translate stickies
into data?
22
22
© 2022 Thoughtworks

Discovery
& Workshops
Proactively uncover unknown risks
Consolidate FIndings
BAs
Refine the program roadmap
PO
Learn from mistakes
BAs
Start several
Inceptions/deliveries
BAs
Use Case: legacy ERP modernization
Goals:
● Strategical: it should help defining a long term
plan backed by data and KPIs that can evolve
over time.
● Comprehensive: it should cover the entire
applications landscape not just a single project.
● Flexible: it should quickly provide answer to basic
questions, but also support further investigations.
CaD Pipeline & Tools
Convert Stickies to Data
DEVs
Parse Code
& Merge with Stickies Data
DEVs Explore & Compute KPIs
DEVs
DEVs
Collect Projects Metrics
BAs

..to data (visualization)
Area
3
M3
M2
M1
Area
2
M3
M2
M1
Area
4
M3
M2
M1
Area
1
M3
M2
M1
from stickies…
1. We split the monolith into
logical Areas & Modules
(both existing and new ones)

Area
3
M3
M2
M1
Area
2
M3
M2
M1
Area
4
M3
M2
M1
Area
1
M3
M2
M1
Tables
from stickies…
2. We map Tables belonging to
just a single Module (if any)

Area
3
M3
M2
M1
Area
2
M3
M2
M1
Area
4
M3
M2
M1
Area
1
M3
M2
M1
APIs
Tables
from stickies…
3. We map APIs belonging to just
a single Module (if any)

Area
3
M3
M2
M1
Area
2
M3
M2
M1
Area
4
M3
M2
M1
Area
1
M3
M2
M1
APIs
Tables
Source Code
from stickies…
4. We parse source code
to identify chain of calls
(who calls whom)
and access to tables
(who reads/writes data where)

Area
3
M3
M2
M1
Area
2
M3
M2
M1
Area
4
M3
M2
M1
Area
1
M3
M2
M1
APIs
Tables
Source Code
from stickies…
4. We parse source code
to identify chain of calls
(who calls whom)
and access to tables
(who reads/writes data where)
5. We map APIs
to source code entry points
(e.g. functions)

we can now explore
each area, module, table,
procedure, or API and
follow interactively
all the trails that connect
stickies to source code.
29
29

CaD Pipeline Goals
● Strategical: integrating workshop
outcomes with technical catalogs, we
can intersect target and current state
(e.g. sizing target features complexity
slicing current implementation).
● Comprehensive: the pipeline can be
easily extended to include more
languages, projects, or artefacts (e.g.
configuration files, parsable
documentation)
● Flexible: leveraging meta-models we
can explore source code in a guided
way or build our own way through it
and identify risks and areas to
deep-dive (e.g. shared dependencies,
domain bleeding, domain complexity).
Build
Annotated
Graph
Source Code Parse
+
Discovery
Stickies
Parse
SMEs
Tables & APIs
Annotations
DB Catalog
APIs Catalog
Parse
Pharo
DEVs/SMEs
Explore Graph
& Build KPIs
Refine Roadmap
BAs/SMEs
Excel
Merge KPIs
& Metrics

Complex data structures
visualization tools
Pharo integrate Roassal
library to display complex
interactive graph-oriented
data structures.
Dynamic graphical
inspector
Every object can be explored
with a graphical inspector
and may define custom views
based on Roassal.
Pause & resume
support
In every moment, we can
pause the exploration and
save objects/views to disk
and restart later from where
we left.
Why Pharo?
Pharo is a fast modern Smalltalk focused on simplicity and immediate feedback.
because we love objects soup ;)

32
Lesson learned so far
Do we have answers?
Not yet, but we started to collect
evidences not just gut feelings.
Look for what overlaps
(e.g. shared libraries, table accessed
by several modules)
to anticipate possible issues
Look for what matches
to collect data about effort
(e.g. LoC).
Look for what doesn’t match
(e.g. table not accessed by code, code
not invoked by other code or API)
to uncover unseen risks.

“In contrast to visual
programming and
diagramming for software
design, software visualization
is not so much concerned with
the construction, but with the
analysis of programs and their
development process.”
33
S. Diehl, Software Visualization
Springer, 1998, ISBN 9783540465041
33

“Challenges in data
visualization does not actually
involve visualizing Data. [...] The
challenge is in crafting a
visualization that is easily
reusable, composable, and
extensible.”
34
A. Bergel, Agile Visualization
Lulu Press, 2016, ISBN 978136531409
34

$ tail -f questions
Alessandro Confetti
Tech Principal
aconfet@thoughtworks.com
35

36

37
W
I
P

Hello, World
main( ) {
extern a, b, c;
putchar(a); putchar(b);
putchar(c); putchar('!*n');
}
a 'hell';
b 'o, w';
c 'orld';
Kernighan, Brian W. (1972). A Tutorial Introduction to the Language B. Bell Laboratories (p 4)

39
What is a
meta-model?
When we need to explore and reason
about complex systems, we need to find
the right kind of representation
(i.e. the right questions we need answer to).
A pragmatic way to find the right balance
between accuracy and outcome.
Concrete
Abstract Meta-[...]-Model
describes
e.g. alphabet, numbers, units,
colors, cartographic projection
Symbols and grammar to represent structure
and vocabulary of a valid meta-model.
Meta-Model
describes
e.g. map legend and conventions
Structure and vocabulary of a valid model.
Model
represents
e.g. street map
Simplified representation of the problem, driven
by questions we need answered.
Subject/Problem
e.g. route between two cities
Something we want to reason about Complex
Simplified
See for reference: J. Bezivin and O. Gerbe, Towards a precise definition of the OMG/MDA
framework,
Proceedings 16th AICASE (ASE 2001), 2001, pp. 273-280, doi: 10.1109/ASE.2001.989813.
If we oversimplify it, we may end up with lot
of underestimated or unmitigated risks.
If we overcomplicate it, we may easily enter
never-ending rabbit-holes and struggle to
deliver the overall picture.

● Can be used to describe different kind of common diagrams (e.g.
E/R, UML), semantics for hierarchical structures (e.g. XML, JSON), or
programming languages (e.g. procedural, functional,
object-oriented).
40
What is a
meta-model?
When we need to explore and reason
about complex systems, we need to find
the right kind of representation
(i.e. the right questions we need answer to).
A pragmatic way to find the right balance
between accuracy and outcome.
If we oversimplify it, we may end up with lot
of underestimated or unmitigated risks.
If we overcomplicate it, we may easily enter
never-ending rabbit-holes and struggle to
deliver the overall picture.
FAMIX meta-model
FAME meta-meta-model
Can be used to describe different kind
of common diagrams (e.g. E/R, UML),
semantics for hierarchical structures
(e.g. XML, JSON), or programming
languages (e.g. procedural, functional,
object-oriented).
Support both procedural and object
oriented languages.
Plugins available for many languages
(e.g. C/C++, C#, Clojure, Java,
JavaScript, JSX/React, PHP).
MSE file format can be used to
export/import models based on FAMIX.

FAME meta-meta-model Description
family of meta-meta-models for
describing and defining meta-models
● All meta-models share a series of
common features and basic enquiring
capabilities.
● Can be used to describe different kind
of common diagrams (e.g. E/R, UML),
semantics for hierarchical structures
(e.g. XML, JSON), or programming
languages (e.g. procedural, functional,
object-oriented).

FAMIX meta-model Description
family of meta-models for representing
the structure of software projects.
● Support both procedural and object
oriented languages.
● Plugins available for many languages
(e.g. C/C++, C#, Clojure, Java,
JavaScript, JSX/React, PHP).
● MSE file format can be used to
export/import models based on
FAMIX.
● All models share a series of common
features and basic enquiring
capabilities (e.g. dependency trees).

43
43
There is an growing community of researchers and tools
We are not alone…

XConf 2022 - Code As Data: How data insights on legacy codebases can fill the knowledge gap in complex modernization projects.

More Related Content

Similar to XConf 2022 - Code As Data: How data insights on legacy codebases can fill the knowledge gap in complex modernization projects. (20)

More from Alessandro Confetti (13)

Recently uploaded (20)

XConf 2022 - Code As Data: How data insights on legacy codebases can fill the knowledge gap in complex modernization projects.