JavaScript Programs Obfuscation Detection Method that Uses Artificial Neural Network with Attention Mechanism

Authors Grigory Ponomarenko, Petr Klyucharev,

Plaintext

JavaScript Programs Obfuscation Detection Method
that Uses Artificial Neural Network with Attention
Mechanism

Grigory Ponomarenko, Petr Klyucharev
Information Security Department
Bauman Moscow State Technical University
Moscow, Russia
gs.ponomarenko@yandex.ru; pk.iu8@yandex.ru

Abstract—In this paper, we consider JavaScript code Schrittwieser et al. remark, that at the beginning of the
obfuscation detection using artificial neural network with computer era obfuscation was commonly used, in particular, to
attention mechanism as classifier algorithm. Obfuscation is surprise users by displaying unexpected messages, but today
widely used by malware writers that want to obscure malicious obfuscation is mostly used to protect intellectual property or
intentions, e.g. exploit kits, and also it is a common component of obscure malicious intentions [7]. Boaz Barak notices, that
intellectual property protection systems. Non-obfuscated obfuscation doesn't make protected program invincible, and
JavaScript code samples were obtained from software repository obfuscated program should be protected from reverse
service Github.com. Obfuscated JavaScript code samples were engineering as much as an encryption system shouldn't be
created by obfuscators found on the same service. Before being
broken using any sensible amount of time and computation
fed to the network, each JavaScript program is converted to the
general path-based representation, i.e. each program is described
resources [8].
by the set of paths in an abstract syntax tree. Model proposed in
this paper is a feedforward artificial neural network with II. PROBLEM DEFINITION
attention mechanism. We aimed to build a model that relies on
Obfuscated and non-obfuscated programs distinguishing
AST paths structures instead of statistical features. According to
problem is indelibly linked to the source code properties
results of experiments, evaluated model potentially can be
implemented with some improvements in malicious code
prediction and various programs classification types. To
detection systems, browser or mobile device fingerprint collection formalize the problem, we will use the definitions introduced
systems etc. by Silveo Cesar and Yang Xiang in the first chapter of their
book "Classification and similarity of programs" [9].
Keywords—obfuscation classification, obfuscated code, Let r be a property for program p if for all possible
obfuscation recognition, Javascript obfuscation, general path-
execution flaws r is true. A program q is called an obfuscated
based representation, ECMAScripit obfuscation, AST-based
copy of a program p if q is the result of transformations that
pattern recognition
preserve the semantics (meaning) of algorithms and data
structures. Programs p and q are similar if they are based on the
I. INTRODUCTION same program.
According to Varnovsky et al. statements [1], obfuscation Let P be the set of source codes of programs, f1,..., fk are
was firstly implicitly mentioned in 1976 in the famous Diffie functions that allocate features from program, i.e. fi: P → Di,
and Hellman paper [2], in which they introduced asymmetric where Di is the i-th set of features. Let {p1, ... , pn} ⊂ P be the
cryptography concept. Diffie and Hellman suggested inserting training sample, {0,1} = Y - class labels (1 is assigned to
a secret key into the encryption program, and then this secret obfuscated programs, 0 is assigned to non-obfuscated). It is
key initialized encryption program becomes tricky converted necessary to find the map s: D1×...×Dk → {0,1} using training
so that the secret key extraction would be a very difficult task. sample {p1, ... , pn} that classifies all elements of P with the
The concept of obfuscation was explicitly introduced in 1997 smallest error function value.
in the Kollberg, Tomborson and Lowe paper [3].
By Han Liu et al. obfuscation is defined as special program III. DATASET PREPARATION
transformation whose purpose is to obscure source code or
binary code in order to hide implemented algorithms and data To create a dataset with obfuscated and non-obfuscated
structures from being recovered [4]. Obfuscated program is Javascript code samples we used software repository
obtained from original after applying obfuscation, and github.com. Github.com is one of the largest service platforms
therefore original program is called non-obfuscated [5, 6]. that features software projects hosting and collaborative

100
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

development. There were downloaded 100 most popular not significantly change the syntactic structure of the program,
JavaScript projects. To get a list of the most popular projects, e.g. jfogs and UglifyJS2, and mostly rename some identifier
we used a special search API (referenced as Github Search and shuffle independent parts. Other obfuscators, such as gnirts
API) provided by the service. Further all projects from the or defendjs, completely change the syntactic structure of the
resulting list were cloned to the local machine. Downloading scripts.
was done on March 22th, 2019 and all downloaded projects
took up 7.3 Gb of the disk space. 49612 files with the ".js" PigeonJS library was used to extract features from the
extension (excluding files with the ".min.js" extension) were JavaScript programs source codes [11]. It provides an API to
retrieved from the obtained data. In order to simplify the get a list of given length paths on the AST.
further creation of obfuscated code samples, it was decided to One path on the AST formed by PingeonJS has following
retrieve functions from scripts. An example of a simple structure called general path-based representation [11]:
JavaScript function is shown in fig. 1.
(1)
Node.js script that retrieves functions from a Javascript
program was written using the Esprima library. With this The vertices are separated by v and ^ depending on whether
library abstract syntax tree (AST) can be formed for any the left vertex is higher or lower on the tree in comparison with
Javascript program that complies with the ECMAScript 2016 right one. One of the paths retrieved from the basic JavaScript
standard. An abstract syntax tree for a script is formed program example (fig. 1) is shown in fig. 3.
according to the syntactic rules of the programming language.
You can apply the inverse transformation and generate the
correct program code from the tree. Unlike to plain source
code, ASTs do not include punctuation, delimiters, comments
and some other details, but they can be used to describe the
syntactic structure of the script along with lexical information
[10]. Abstract syntax tree example based on the simple
JavaScript program (fig. 1) is shown in fig. 2.
There were built ASTs for all previously downloaded
scripts with the extension ".js" (49612 samples) using the
parseModule method provided by the Esprima API. Program
code for each element of the "FunctionDeclaration" was saved
into separate files during the tree traversal. Hereby 126276 files
were produced, each file contained a JavaScript function, all
files took up 527 Mb of the disk space.
To generate obfuscated code samples, we used special
programs that implement JavaScript code obfuscation. On the
mentioned above github.com software repository hosting we
found 6 obfuscators that fit our needs. They are listed below:
• javascript-obfuscator/javascript-obfuscator
• zswang/jfogs
• anseki/gnirts
• mishoo/UglifyJS2
• alexhorn/defendjs
• wearefractal/node-obf
Obfuscators can work in different ways. Some obfuscators do

Fig. 2. AST of the basic JavaScript function

Fig. 1. Basic JavaScripit function example
Fig. 3. One of the paths extracted from the basic JavaScript function AST

101
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

Not all scripts have been obfuscated with all of obfuscators
listed before and contexts were retrieved not from all programs.
Main purpose for this was that some of the downloaded
JavaScript files contained programs that have nonstandard
features and extensions. Besides, some obfuscated scripts took
up to 1Gb file storage space although original scripts had size
up to 200-300 Kb. We decided to take such samples away from
the dataset.

IV. NEURAL NETWORK ARCHITECTURE
The architecture of an artificial neural network was used in
this study is based on the network, proposed by Uri Alon et al.
in their paper "code2vec: Learning Distributed Representations
of Code" [12]. Researchers attempted to create an artificial
neural network that predicts method names for programs
written in Java. They got excellent results: at the time of the
article publication, they had the best percentage of correctly
named methods among all known studies — about 60%. So we
decided to adapt that network for JavaScript code obfuscation
recognition problem solving.
Fig. 4. Neural network ahitecture scheme
Main objects the network is working on are script contexts.
The context si = (xs, p, xt) is a tuple containing three elements:
the start vertex, the path, and the terminal vertex. Start vertex xs
and terminal vertex xt are elements of the start and terminal and is a fully connected layer weights matrix.
vertices set T. Path p is an element of the paths set P. Every Based on the combined contexts {d1, ... , d200} and the
JavaScript program (it does not matter, is it obfuscated or non- attention vector α, the attention weights αi are calculated for
obfuscated) is described with a set of contexts: each di. Vector α is initialized with random variables and
(2) updated during the training.

Each element x of the vertices set T has its own vector (7)
representation vx in VectT (128-dimensional rational vector).
Similarly, each element p of the paths set P has its own vector
representation vp in VectP (128-dimensional rational vector). In Obviously, the sum of all αi equals 1. After that a code
that way each context has 384-dimensional vector vector v is calculated using the attention weights αi as follows:
representation that looks like this:
(3) (8)
Then each script is described by a tuple of contexts vector Since all attention weights αi are nonnegative, and their
representations: sum equals to 1, we can consider the calculation of the code
(4) vector as the calculation of the weighted average over all
combined contexts di.
Maximum number of contexts per script was 200. If there
were less then 200 contexts for some script then contexts tuple The idea behind the attention mechanism can be described
was padded with zero-filled contexts: as choosing the most interesting part of the resulting set. The
softmax transformation (7) is a key component of several
(5) statistical learning models but recently it has also been used to
Artificial neural network architecture used in this research design attention mechanisms in neural networks [13]. Attention
is shown in fig. 4. First of all, there is fully connected layer to mechanisms are used to solve various applied problems with
which Dropout regularization method was applied. Thanks to the help of artificial neural networks, e.g. multilingual
this 75% randomly chosen neurons are ignored (not considered translation [14], sentiment classification [15], time-series
during forward pass) on each epoch. This helps to prevent classification [16], vehicle images classification [17] or speech
over-fitting of training data and increases model performance recognition [18].
on non-observed samples. At the last step the final solution is calculated using 128-
Fully connected layer have tanh activation function: dimensional real vectors yobf and ynotobf: obfuscated or non-
obfuscated script was passed to the network input. Vectors yobf
(6) and ynotobf are initialized randomly and updated during model
training. JavaScript program obfuscation probability q(v) is
where is a context vector representation,
calculated based on code vector (7):
is a combined context vector representation

102
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

(9) TABLE I. MODEL SCORES

Metric Value

If q(v) > 0.5 then script is thought to be obfuscated. Script Precision 84.9%
non-obfuscation probability is estimated as 1−q(v) respectively. Recall 85.1%
For one script, the loss function (cross-entropy function) is
computed as follows: F1 85.0%

(10)
where p(v) = 1 for obfuscated scripts and p(v) = 0 for non- Our model showed less well performance than the model
obfuscated scripts. To minimize the loss function, the method proposed by Tellenbach et. al. Their model used features
of adaptive moment estimation (Adam) was used as an reflecting the frequencies of JavaScript keywords and other
optimization algorithm. statistical statistical calculations and had following evaluations:
precision – 95%, recall – 90%, F1-score – 92% [19].
V. MODEL TRAINIG AND EVALUATION At the same time the presented model has sufficient
Model training and evaluation were proceeded on the improvement potential gives rise to further research of
workstation with the following equipment: Intel Core i7-7700 obfuscation detection models that do not rely on pre-calculated
processor (3.6 GHz) with 8 cores, 16 GB of RAM, NVIDIA statistical features. First of all, second fully connected layer and
GeForce GTX 1080 GPU. The training dataset was formed as activation function replacement with different one could
follows: 115504 context samples describing non-obfuscated positively impact model quality scores.
functions and 117990 context samples describing obfuscated Beyond that, code vector v (7) can be passed to additional
functions, among them 36000 randomly chosen from all classifier input, e.g. SVM-based or Random Forest based. A
samples obfuscated with "javascript-obfuscator" , 36000 similar approach Ndichu et al. proposed to solve the JavaScript
randomly chosen from all samples obfuscated with "jfogs", malware detection problem using feedforward neural network
36000 randomly chosen from all samples obfuscated with [20]. They divided model training process into two stages: on
"UglifyJS2" and 9990 – from samples obfuscated with the first one they trained neural network classifier based on
"defendjs". There were 233494 context samples in sum. Doc2vec and on the second one they passed fully connected
A set Tp (|Tp|=776830) of the most popular names of start layer output to the SMV. As a result, SVM was trained on the
and final context vertices and a set Pp (|Pp|=1008102) were code embeddings [20]. Their model that combines Doc2vec
obtained from the training sample so that for each script s at and SMV had following evaluation results: precision – 94%,
least one context c contains two elements from T and one recall – 92% and F1-score - 93% on the obfuscated samples.
element from P.
VI. CONCLUSION
Testing dataset contained 8444 contexts describing
obfuscated functions (7655 samples obfuscated with "gnirts" In this paper we explored JavaScript (ECMAScript 2016)
and 789 samples obfuscated with "jfogs") and 8444 contexts code obfuscation detection method that uses artificial neural
describing non-obfuscated functions. network with attention mechanism as classifier algorithm.

We decided to use precision (10), recall (11) and F1-score First of all, a set of samples of obfuscated and non-
(12) as model evaluation metrics explaining model obfuscated code was obtained using projects and repositories
performance. hosted on github.com. Secondly, an artificial neural network
model with an attention mechanism was adapted to solve the
(11) problem of scripts classification on the obfuscation basis.
Thirdly, non-obfuscated dataset could be checked for the
presence of obfuscated samples uploaded to github.com
(12) repository, downloaded during the dataset preparation stage
and erroneously labeled as non-obfuscated.
The characteristics of the obtained model show that the
(13) considered method potentially can be implemented with some
improvements in malicious code detection systems, browser or
mobile device fingerprint collection systems or other software
Model training was 9 epochs long. Precision, recall and F1- that use obfuscation recognition.
score obtained after model training completion are shown in
Table 1.

REFERENCES [12] Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav. Сode2vec:
Learning Distributed Representationsof Code. Proc. ACM Program.
Lang.3, POPL, 2019, 40, P. 1-29
[1] N.P. Varnovsky, V.A.Zakharov, N.N. Kuzurin, V.A. Shokurov. The
current state of art in program obfuscations:definitions of obfuscation [13] Martins, Andre, and Ramon Astudillo. "From softmax to sparsemax: A
security. Proceedings of the Institute for System Programming, vol. 26, sparse model of attention and multi-label classification." In International
issue 3, 2014, pp. 167-198. DOI: 10.15514/ISPRAS-2014-26(3)-9. Conference on Machine Learning, pp. 1614-1623. 2016.
[2] Diffie W., Hellman M. New directions in cryptography // IEEE [14] Firat, Orhan, Kyunghyun Cho, and Yoshua Bengio. "Multi-way,
multilingual neural machine translation with a shared attention
Transactions on Information Theory, IT-22(6), 1976, p.644-654.
mechanism." In 15th Conference of the North American Chapter of the
[3] Collberg C., Thomborson C., Low D. A Taxonomy of Obfuscating Association for Computational Linguistics: Human Language
Transformations // Technical Report, N 148, Univ. of Auckland, 1997. Technologies, NAACL HLT 2016, pp. 866-875. Association for
[4] Liu, H., Sun, C., Su, Z., Jiang, Y., Gu, M. and Sun, J., 2017, May. Computational Linguistics (ACL), 2016.
Stochastic optimization of program obfuscation. In 2017 IEEE/ACM [15] Wang, Yequan, Minlie Huang, and Li Zhao. "Attention-based LSTM for
39th International Conference on Software Engineering (ICSE) (pp. 221- aspect-level sentiment classification." In Proceedings of the 2016
231). IEEE. conference on empirical methods in natural language processing, pp.
[5] Kozachok A., Bochkov M., Tuan L.M. Indistinguishable Obfuscation 606-615. 2016.
Security Theoretical Proof. Voprosy kiberbezopasnosti [Cybersecurity [16] Du, Qianjin, Weixi Gu, Lin Zhang, and Shao-Lun Huang. "Attention-
issues], 2016. N 1 (14). P. 36-46. based LSTM-CNNs For Time-series Classification." In Proceedings of
[6] Markin D., Makeev S. Protection System of Terminal Programs Against the 16th ACM Conference on Embedded Networked Sensor Systems,
Analysis Based on Code Virtualization. Voprosy kiberbezopasnosti pp. 410-411. ACM, 2018.
[Cybersecurity issues], 2020, N 1 (35), pp. 29-41. DOI: 10.21681/2311- [17] Zhao, D., Chen, Y., & Lv, L. (2017). Deep Reinforcement Learning
3456-2020-01-29-41. With Visual Attention for Vehicle Classification. IEEE Transactions on
[7] Schrittwieser, S., Katzenbeisser, S., Kinder, J., Merzdovnik, G., & Cognitive and Developmental Systems, 9(4), 356–367.
Weippl, E. (2016). Protecting Software through Obfuscation. ACM [18] Kim, Suyoun, Takaaki Hori, and Shinji Watanabe. "Joint CTC-attention
Computing Surveys, 49(1), 1–37. based end-to-end speech recognition using multi-task learning." In 2017
[8] Barak, B. (2016). Hopes, fears, and software obfuscation. Commun. IEEE international conference on acoustics, speech and signal
ACM, 59(3), 88-96. processing (ICASSP), pp. 4835-4839. IEEE, 2017.
[9] Silvio Cesare, Yang Xiang. Software Similarity and Classification. [19] Tellenbach B, Paganoni S, Rennhard M. Detecting obfuscated
Springer-Verlag, 2012. JavaScripts from known and unknown obfuscators using machine
[10] Zhang, Jian, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, learning. International Journal on Advances in Security.
and Xudong Liu. "A novel neural source code representation based on 2016;9(3/4):196-206.
abstract syntax tree." In Proceedings of the 41st International [20] Ndichu, S., Kim, S., Ozawa, S., Misu, T. and Makishima, K., 2019. A
Conference on Software Engineering, pp. 783-794. IEEE Press, 2019. machine learning approach to detection of JavaScript-based attacks
[11] Alon, Uri, Meital Zilberstein, Omer Levy, Eran Yahav. A general path- using AST features and paragraph vectors. Applied Soft Computing, 84,
based representation for predicting program properties. ACM SIGPLAN p.105721.
Notices, vol. 53, no. 4, pp. 404-419. ACM, 2018.

104