The Future of Automated Malware Generation

The Future of
Automated Malware Generation
Stephan Chenette
Director of Security Research & Development

1

Who Am I?
• Stephan Chenette @StephanChenette (twitter)
• Currently Director of Security R&D @ IOActive
•Building / Breaking / Hacking / Researching

• R&D @ eEye Digital Security 4+ years
• Head Security Researcher @ Websense 6+ years
• (Graduate Student @ UCSD - Network Security)

2

What I hope you learn…
• An understanding of the current malware landscape
• Various malware/exploit defense techniques
• Where I think detection/defense technologies are
headed
• How malware authors will most likely react
 drive the future of automated malware generation

3

Statement
This particular topic/area is a personal research interest
of mine –

I’m hoping to basically motivate you to think offensively
when building or using defensive technologies…

For Example: I’m currently helping on an open source
automated detection technology for the cuckoo
sandbox – and am trying to evade/bypass it at the same
time

4

Agenda
• Current State of Automated Malware Generation
• Current State of Malware Defense (Tech.)
• Malware Trends
• The Future of Malware Defense
• The Future of Automated Malware Generation

5

Malware Distribution Networks
(MDNs)

6

Malware Distribution Networks
Malware has evolved into a profitable business for
cyber criminals

•Complex/Organized/Distributed Network
•Malware Distribution Network (MDNs)
•Pay-per-install (PPI) clients (RogueAV, SpamBot, keylogger)
•PPI Services
•PPI Affiliates (landing pages, redirection services, etc.)

7

Malware Distribution Networks (MDNs)

2 3 4
1

Source: Microsoft Security Intelligence Threat Report (http://www.microsoft.com/sir )

8

Malware Distribution Networks (MDNs)

Single Sample Repository
A repository that does not update the malicious
executable for the lifetime of the repository.

Multiple Sample Repository
A repository that performs updates to the malicious
executable over time, but is not generating the
samples for each request

Polymorphic/Metamorphic Repository
A repository that produces a unique malicious
executable for every download request
9

Example: Blackhole Exploit Kit
Blackhole contains an integrated AV scanner and will auto-repackage if
malware is detected

Figure: Blackhole exploit kit download chain

Source: Manufacturing Compromise: The Emergence of Exploit-as-a-Service
(http://cseweb.ucsd.edu/~voelker/pubs/eaas-ccs12.pdf)

10

Exploit Kits and Malware
Blackhole | Ingognito || ZeroAccess | TDSS

Source: Manufacturing Compromise: The Emergence of Exploit-as-a-Service
(http://cseweb.ucsd.edu/~voelker/pubs/eaas-ccs12.pdf)

11

Agenda
• Malware Trends

12

Current State of Malware
Defense (Tech.)

13

Current Techniques
• Hash
• Signatures
• Heuristics
• Semantics-aware detection

14

Current Techniques
Attacker Defender
Easier to bypass Easier to implement

Harder to change Harder to implement

15

Hash-based detection
• Full file hashing (cryptographic checksum)
•MD5, SHA1, SHA256

• Portable Executable (PE)
•Sectional hashing
•Custom hashing
•Fuzzy hashing (ssdeep)

• Error on the side of caution

16

Defeating Hash-based detection
• Create Unique malware sample per user request
•Randomizing single byte in irrelevant file offset
•Re-packaging binary (FSG, ASPack, Themida)
•Re-building malware dynamically

17

Signature-based detection
• Regular Expression based signatures (PCRE, RE2)
• Byte-signatures
rule ASPack
{
strings:
$ = { 60 E8 ?? ?? ?? ?? 5D 81 ED ?? ?? (43 | 44) ?? B8 ?? ?? (43 | 44) ?? 03 C5 }
$ = { 60 EB ?? 5D EB ?? FF ?? ?? ?? ?? ?? E9 }
$ = { 60 EB 03 5D FF E5 E8 F8 FF FF FF 81 ED 1B 6A 44 00 BB 10 6A 44 00 03 DD 2B 9D 2A }
$ = { 60 E8 00 00 00 00 5D ?? ?? ?? ?? ?? ?? BB ?? ?? ?? ?? 03 DD }
$ = { 60 E8 41 06 00 00 EB 41 }
$ = { 60 E8 7? 05 00 00 EB (33 | 4C) }

condition:

for any of them : ($ at entrypoint)
}

• Deeper contextual content scanning with proprietary
language
18

Defeating Signature-based detection
• Syntax mutation easily defeats this technique
• Garbage Code Insertion e.g. NOP, “MOV ax, ax”, “SUB ax 0”
• Register Renaming
• Subroutine Permutation
• Code Reordering through Jumps
• Equivalent instruction substitution
Instruction Equivalent instruction
MOV EAX, EBX PUSH EBX, POP EAX

Call Emulated Call Misused Call
CALL <target> PUSH <PC + sizeof(PUSH) + sizeof(JMP)> CALL <target>
JMP <target>
.target
POP <register-name>

• Same behavior but different syntax
19

Heuristics are introduced…
AV engines were forced to evolve and use heuristics by
way of emulation/behavioral analysis due to:
•Polymorphic engines
• Encrypt body with randomly generated encryption
algorithm
• Private key normally in decoding engine
•Metamorphic engines
• Employs obfuscation/substitution techniques instead of encryption
• Junk insertion, equivalent instruction substitution, etc.

20

Heuristics-based detection
General term for the different techniques used to
detect malware by their behavior
Emulation, API hooking, sand-boxing, file anomalies and other analysis techniques

Rule A
Rule B
Rule C

IF Rule A then Rule B then Rule C then Poison Ivy

Source: (http://http://hooked-on-mnemonics.blogspot.com)

21

Defeating Heuristics-based detection
• Detect emulation and execute different code path
• Break emulation engine
• Avoid the heuristics

• Overall solid method
• Possible false positives

22

Semantics-aware Detection
• Captured execution trace is transformed into a higher-level
representation capturing its semantic meaning, i.e., the trace
is ﬁrst abstracted before being compared to a malicious
behavior
• Make the time to build the code flow or extraction of a
model infeasible for real-time AV using time lock puzzles

• Intermediate representation (IR)
• Abstract Syntax Trees, Register Transfer Language

23

Semantics-aware detection

Good idea in theory, but unknown (to me) how widely
implemented this is in security products

24

Defeating Semantics-aware detection

Implementation is difficult
Limited support for equivalent code sequences

a = b * 2
a = b << 1
A left arithmetic shift by n is equivalent to multiplying by 2n
(provided the value does not overflow)

Focus on same techniques used to defeat signatures
and heuristics + likelihood of limited support less
popular instructions
25

Agenda
• Malware Trends

27

Malware Detection Reality Check
• How well are current detection techniques working?

33%!
29

Malware Samples
Observation: # of Malware Samples are increasing

Source: Mcafee Global Q12012 Threat Report
(http://mcafee.com/us/resources/reports/rp-quarterly-threat-q1-2012.pdf)

30

Mobile Malware Samples
Observation: # of Android Malware Samples are
increasing

Source: Kaspersky Q12012 Threat Report
(http://www.securelist.com/en/analysis/204792231/IT_Threat_Evolution_Q1_2012)

31

Use of Behavior Sandboxes
Client binary is malware but isn’t detected.
Suspicious files are sent back to “home base/cloud”
lab for analysis
1.Sent to sandbox system
2.Meta data report is created for easier export of
new rules
a. Hash and blacklist entries are added
b. Signatures are added
c. Heuristic detection is added

32

The Overworked Malware Analyst

33

Solving the problem with people
Malware Analysts Malware Samples
Samples

A D!!
L O
O VER

34

Agenda
• Malware Trends

35

The Future of Malware Defense

Skynet? …probably not
But some of the concepts aren’t too far fetched…

36


Perhaps malware detection should have more
science applied to it.

37

The Malware Infinity Problem
Malware detection
As malware samples approaches ∞ we can’t manually
add detection for every file. We must model WHAT
actions malware take, HOW it makes those actions
and WHERE it makes connected.

Malware Attribution
As Attack Surface approaches ∞ we can’t defend
everything from everyone. We must model WHO is
after WHICH assets and HOW they attack.

38

IF we are going to start modeling we must make
some assumptions:

1.Attackers are going to change their code and
techniques only enough to avoid detection
2.The majority of malware/exploits code and
techniques will continue to represent future
malware/exploits code and techniques

39

The Who is important…
“Researchers at Symantec traced the group’s work after
finding a number of similarities between the Google attack
code and methods and those used against other
companies and organizations over the last few years.

The researchers, who describe their findings in a report
published Friday, say the gang — which they have dubbed
the “Elderwood gang” based on the name of a
parameter used in the attack codes — appears to
have breached more than 1,000 computers in
companies spread throughout several sectors –
including defense, shipping, oil and gas, financial,
technology and ISPs. The group has also targeted non-
governmental organizations, particularly ones connected
to human rights activities related to Tibet and China”

Source: http://www.wired.com/threatlevel/2012/09/google-
hacker-gang-returns/

40

Statistics
A discipline that makes you understand data and
makes you make decisions based on data
S
T
A
T
I
Data S Decisions
T
I
C
S
41

Train the Machines
•Classify
•Cluster

42

Automatic Classification
Steps:
1.Extract features
2.Train models using ML
algorithms
3.Feature Selection
4.Use models as classifiers
5.Use models to classify
unknown files as 0 or 1

Source: http://eval.symantec.com/mktginfo/enterprise/white_papers/b-dlp_machine_learning.WP_en-us.pdf
43

Machine learning
Where we train computers to make statistical
decisions on real-time data based on inputted data

While machine learning as a concept has been
around for decades and has been used in everything
from anti-spam engines to Google™ algorithms for
translating text, it is only now being applied to web
filtering, DLP and malware content analysis.

44

Historical Observation
Historically certain malware has
•No icon
•No description or company in resource section
•Is packed
•Lives in windows directory or user profile

These are the type of “features” that expert humans
would feed to machine learning classifiers to train on

45

Expert Humans train Machines
“You can’t effectively and consistently manage what you can’t
measure, and you can’t measure what you haven’t defined…”
SOURCE: http://fairwiki.riskmanagementinsight.com/?page_id=3

•The job of the human
•List features

•The job of the machine
•Model which features are important, in what grouping and in what order
•Classify
•Cluster

46

Machine Learning (ML) Algorithms

• Naive Baysian Classifier (each feature is independent of the
other features)
• Support Vector Machine (SVM) when high dimensionality (high
dimensionality.. more than a thousand of variables are in the
model)
• Random Forest when you want an interpretable model (<
2000 features)
• Marchov Chains (Natural Language Processing) for when you
want to assess the sequence probability

47


Network
File System
Physical Memory

Inspection Point

Every Layer provides various degrees of
“features” to inspect

48


49

Existing Academic work…
• D. Plonka and P. Barford. Context-Aware Clustering of DNS Query
Trafﬁc. In Proceedings of the 8th ACM SIGCOMM conference on
Internet Measurement, October 2008.

• R. Perdisci, W. Lee, and N. Feamster. Behavioral Clustering of HTTP-
Based Malware and Signature Generation Using Malicious Network
Traces. In Proceedings of the 7th USENIX conference on Networked
Systems Design and Implementation, April 2010.

• K. Rieck, P. Trinius, C. Willems, T. Holz. Automatic Analysis of
Malware Behavior using Machine Learning. e Journal of Computer
Security, 2011

50

Projects using machine learning
•Razorbacktm -
http://sourceforge.net/projects/razorbacktm/files/
•Malheur - http://www.mlsec.org/malheur/
•Malvic - http://www.malvic.org
•Adobe Open Source Malware Classification Tool
http://sourceforge.net/projects/malclassifier.adobe/
• 98.21% accuracy
• 6.7% false positive rate
• 7 features = DebugSize, ImageVersion, IatRVA, ExportSize,
ResourceSize, VirtualSize2, NumberOfSections

51

Statistics Based Detection Tools

52

•Using Machine learning for malware detection is only as
useful as the features you create and the good and bad
sample sets it’s trained on.
• Features
• Good Sample Set
• Bad Sample Set

• If you have 1000’s of samples but on the same malware or
sample exploit…not good!!!

53

PDF Example Features
• Compressed JavaScript
• PDF header location e.g %PDF - within first 1024 bytes
• Does it contain an embedded file (e.g. flash, sound file)
• Signed by a trusted certificate
• Encoded/Encrypted Streams e.g. FlatDecode
• Names hex escaped
• Bogus xref table

Reference: http://blog.fireeye.com/files/27c3_julia_wolf_omg-wtf-pdf.pdf

54

Detecting shellcode
• Marchov chains
To determine probability of
instruction sequences 0.3

• Technique clustering 0.7
0.4

0.6

XOR ECX, ECX ; ECX = 0
MOV ESI, [FS:ECX + 0x30] ; ESI = &(PEB) ([FS:0x30])
MOV ESI, [ESI + 0x0C] ; ESI = PEB->Ldr
MOV ESI, [ESI + 0x1C] ; ESI = PEB->Ldr.InInitOrder next_module:
MOV EBP, [ESI + 0x08] ; EBP = InInitOrder[X].base_address
MOV EDI, [ESI + 0x20] ; EBP = InInitOrder[X].module_name (unicode)
MOV ESI, [ESI] ; ESI = InInitOrder[X].flink (next module)
CMP [EDI + 12*2], CL ; modulename[12] == 0 ?
JNE next_module ; No: try next module.

55

Shellcode detection

Decoder routine clustering
Detect entropy of bytes to indicated encoded
payload

...features =]

56

Malware features in action …
• Features:
•Static:
• Packed
• File size
• Origin
•Dynamic (Network)
• Makes a connection
• Number of DNS request
• Encrypted Communication
• Burst/length of communication
•Dynamic (File)
• Register keys
• File level modifications
57

• Choose features that are harder for the attacker to
change.
•E.g. bot network communication protocol
(if not encrypted)

58

Agenda
• Malware Trends

59

The Future of Automated
Malware Generation

60

The Future of Malware Offense
The Attacker has a few things in their favor:
1.Prone to False Positives
Machine learning can be prone to false positives and false negatives
if feature and sample sets aren’t extensive enough
1.Avoid Feature Indicators
Detection via machine learning can be defeated if an attacker can
find out where the features are and avoid them
1.New Features Come Out…
You can't protect yourself from a new weapon if you don't know it
exist

61

Prone to false positives
If the defense side creates models based on a small sample
set or a sample set that doesn’t represent a diverse enough
sample set than the model will be too restrictive – false
negatives

If the defense creates models based only on malicious files
and not enough good files there will be tons of false positives

An Attacker can always try poison the sample sets if they have
enough manipulation power and resources (VirusTotal)
62

Avoid feature indicators
• Attackers can always do the same research and model generic
malware and avoid features that are being used by most
malware
• …to instead use features that that are more popular in benign
software
• This will also avoid being placed in known clusters

63

New features come out…
• If format changes, or gets updated:

•A new file/protocol parser must be created/updated to
understand and extract features

•The model must be retrained and shipped out

64

…OR Just keep is simple
Encrypt binaries with a user-specific key so that AV
can’t decrypt it

•Targeted binary like Gauss
•Encrypted DLL with user key

•Zeus
•Encrypted the downloaded binary with user key

65

Conclusion
• Complex/Organized Network
• Malware distribution network (MDNs)
•Pay-per-install (PPI) clients
•Malware crypt services will include
• Feature verification
• anti-clustering technology  the Future?
• anti-classification technology  The Future?

Will this be the future of automated
malware generation? Or will it just be more
of the same?
66

Conclusion

Today, what I hope that you learned is that if
you want to truly understand your defensive
technology you have to understand it’s
limitations and look at things from an
attacker/offensive viewpoint.

67

Conclusion

Proper security is all about a defense-in-depth
strategy. Create multiple layers of defense.
Every layer presenting a different set of
challenges, requiring different skill sets and
technology.
So every layer will increase the time and effort
to compromise your environment and
exfiltration data.
68

Conclusion

External reconnaissance
Penetration
Internal reconnaissance + stage persistent state
Exfiltration

If security strategy is successful:
via your layered defenses the attack is stopped
before exfiltration of data can happen.
69

Questions?

questions.py:
while len(questions) > 0:
if time <= 0:
break
print answers[questions.pop()]

70

Thanks Pacsec!
Stephan Chenette | @StephanChenette
Director of Research and Development

IOActive, Inc. http://ioactive.com

71

The Future of Automated Malware Generation

More Related Content

What's hot (20)

Similar to The Future of Automated Malware Generation (20)

More from Stephan Chenette (8)

Recently uploaded (20)

The Future of Automated Malware Generation