DOKK Library

JavaScript Programs Obfuscation Detection Method that Uses Artificial Neural Network with Attention Mechanism

Authors Grigory Ponomarenko Petr Klyucharev

License CC-BY-4.0

   Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

  JavaScript Programs Obfuscation Detection Method
   that Uses Artificial Neural Network with Attention

                                              Grigory Ponomarenko, Petr Klyucharev
                                                  Information Security Department
                                             Bauman Moscow State Technical University
                                                          Moscow, Russia

    Abstract—In this paper, we consider JavaScript code                     Schrittwieser et al. remark, that at the beginning of the
obfuscation detection using artificial neural network with              computer era obfuscation was commonly used, in particular, to
attention mechanism as classifier algorithm. Obfuscation is             surprise users by displaying unexpected messages, but today
widely used by malware writers that want to obscure malicious           obfuscation is mostly used to protect intellectual property or
intentions, e.g. exploit kits, and also it is a common component of     obscure malicious intentions [7]. Boaz Barak notices, that
intellectual property protection systems. Non-obfuscated                obfuscation doesn't make protected program invincible, and
JavaScript code samples were obtained from software repository          obfuscated program should be protected from reverse
service Obfuscated JavaScript code samples were             engineering as much as an encryption system shouldn't be
created by obfuscators found on the same service. Before being
                                                                        broken using any sensible amount of time and computation
fed to the network, each JavaScript program is converted to the
general path-based representation, i.e. each program is described
                                                                        resources [8].
by the set of paths in an abstract syntax tree. Model proposed in
this paper is a feedforward artificial neural network with                                 II. PROBLEM DEFINITION
attention mechanism. We aimed to build a model that relies on
                                                                           Obfuscated and non-obfuscated programs distinguishing
AST paths structures instead of statistical features. According to
                                                                        problem is indelibly linked to the source code properties
results of experiments, evaluated model potentially can be
implemented with some improvements in malicious code
                                                                        prediction and various programs classification types. To
detection systems, browser or mobile device fingerprint collection      formalize the problem, we will use the definitions introduced
systems etc.                                                            by Silveo Cesar and Yang Xiang in the first chapter of their
                                                                        book "Classification and similarity of programs" [9].
   Keywords—obfuscation classification, obfuscated code,                    Let r be a property for program p if for all possible
obfuscation recognition, Javascript obfuscation, general path-
                                                                        execution flaws r is true. A program q is called an obfuscated
based representation, ECMAScripit obfuscation, AST-based
                                                                        copy of a program p if q is the result of transformations that
pattern recognition
                                                                        preserve the semantics (meaning) of algorithms and data
                                                                        structures. Programs p and q are similar if they are based on the
                       I.    INTRODUCTION                               same program.
    According to Varnovsky et al. statements [1], obfuscation               Let P be the set of source codes of programs, f1,..., fk are
was firstly implicitly mentioned in 1976 in the famous Diffie           functions that allocate features from program, i.e. fi: P → Di,
and Hellman paper [2], in which they introduced asymmetric              where Di is the i-th set of features. Let {p1, ... , pn} ⊂ P be the
cryptography concept. Diffie and Hellman suggested inserting            training sample, {0,1} = Y - class labels (1 is assigned to
a secret key into the encryption program, and then this secret          obfuscated programs, 0 is assigned to non-obfuscated). It is
key initialized encryption program becomes tricky converted             necessary to find the map s: D1×...×Dk → {0,1} using training
so that the secret key extraction would be a very difficult task.       sample {p1, ... , pn} that classifies all elements of P with the
The concept of obfuscation was explicitly introduced in 1997            smallest error function value.
in the Kollberg, Tomborson and Lowe paper [3].
    By Han Liu et al. obfuscation is defined as special program                         III. DATASET PREPARATION
transformation whose purpose is to obscure source code or
binary code in order to hide implemented algorithms and data                To create a dataset with obfuscated and non-obfuscated
structures from being recovered [4]. Obfuscated program is              Javascript code samples we used software repository
obtained from original after applying obfuscation, and         is one of the largest service platforms
therefore original program is called non-obfuscated [5, 6].             that features software projects hosting and collaborative

   Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

development. There were downloaded 100 most popular                     not significantly change the syntactic structure of the program,
JavaScript projects. To get a list of the most popular projects,        e.g. jfogs and UglifyJS2, and mostly rename some identifier
we used a special search API (referenced as Github Search               and shuffle independent parts. Other obfuscators, such as gnirts
API) provided by the service. Further all projects from the             or defendjs, completely change the syntactic structure of the
resulting list were cloned to the local machine. Downloading            scripts.
was done on March 22th, 2019 and all downloaded projects
took up 7.3 Gb of the disk space. 49612 files with the ".js"                PigeonJS library was used to extract features from the
extension (excluding files with the ".min.js" extension) were           JavaScript programs source codes [11]. It provides an API to
retrieved from the obtained data. In order to simplify the              get a list of given length paths on the AST.
further creation of obfuscated code samples, it was decided to              One path on the AST formed by PingeonJS has following
retrieve functions from scripts. An example of a simple                 structure called general path-based representation [11]:
JavaScript function is shown in fig. 1.
    Node.js script that retrieves functions from a Javascript
program was written using the Esprima library. With this                    The vertices are separated by v and ^ depending on whether
library abstract syntax tree (AST) can be formed for any                the left vertex is higher or lower on the tree in comparison with
Javascript program that complies with the ECMAScript 2016               right one. One of the paths retrieved from the basic JavaScript
standard. An abstract syntax tree for a script is formed                program example (fig. 1) is shown in fig. 3.
according to the syntactic rules of the programming language.
You can apply the inverse transformation and generate the
correct program code from the tree. Unlike to plain source
code, ASTs do not include punctuation, delimiters, comments
and some other details, but they can be used to describe the
syntactic structure of the script along with lexical information
[10]. Abstract syntax tree example based on the simple
JavaScript program (fig. 1) is shown in fig. 2.
    There were built ASTs for all previously downloaded
scripts with the extension ".js" (49612 samples) using the
parseModule method provided by the Esprima API. Program
code for each element of the "FunctionDeclaration" was saved
into separate files during the tree traversal. Hereby 126276 files
were produced, each file contained a JavaScript function, all
files took up 527 Mb of the disk space.
   To generate obfuscated code samples, we used special
programs that implement JavaScript code obfuscation. On the
mentioned above software repository hosting we
found 6 obfuscators that fit our needs. They are listed below:
    • javascript-obfuscator/javascript-obfuscator
    • zswang/jfogs
    • anseki/gnirts
    • mishoo/UglifyJS2
    • alexhorn/defendjs
    • wearefractal/node-obf
Obfuscators can work in different ways. Some obfuscators do

                                                                        Fig. 2. AST of the basic JavaScript function

Fig. 1. Basic JavaScripit function example
                                                                        Fig. 3. One of the paths extracted from the basic JavaScript function AST

   Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

     Not all scripts have been obfuscated with all of obfuscators
listed before and contexts were retrieved not from all programs.
Main purpose for this was that some of the downloaded
JavaScript files contained programs that have nonstandard
features and extensions. Besides, some obfuscated scripts took
up to 1Gb file storage space although original scripts had size
up to 200-300 Kb. We decided to take such samples away from
the dataset.

    The architecture of an artificial neural network was used in
this study is based on the network, proposed by Uri Alon et al.
in their paper "code2vec: Learning Distributed Representations
of Code" [12]. Researchers attempted to create an artificial
neural network that predicts method names for programs
written in Java. They got excellent results: at the time of the
article publication, they had the best percentage of correctly
named methods among all known studies — about 60%. So we
decided to adapt that network for JavaScript code obfuscation
recognition problem solving.
                                                                        Fig. 4. Neural network ahitecture scheme
    Main objects the network is working on are script contexts.
The context si = (xs, p, xt) is a tuple containing three elements:
the start vertex, the path, and the terminal vertex. Start vertex xs
and terminal vertex xt are elements of the start and terminal           and                   is a fully connected layer weights matrix.
vertices set T. Path p is an element of the paths set P. Every              Based on the combined contexts {d1, ... , d200} and the
JavaScript program (it does not matter, is it obfuscated or non-        attention vector α, the attention weights αi are calculated for
obfuscated) is described with a set of contexts:                        each di. Vector α is initialized with random variables and
                                                                (2)     updated during the training.

    Each element x of the vertices set T has its own vector                                                                                (7)
representation vx in VectT (128-dimensional rational vector).
Similarly, each element p of the paths set P has its own vector
representation vp in VectP (128-dimensional rational vector). In           Obviously, the sum of all αi equals 1. After that a code
that way each context has 384-dimensional vector                        vector v is calculated using the attention weights αi as follows:
representation that looks like this:
                                                                (3)                                                                        (8)
    Then each script is described by a tuple of contexts vector            Since all attention weights αi are nonnegative, and their
representations:                                                        sum equals to 1, we can consider the calculation of the code
                                                                (4)     vector as the calculation of the weighted average over all
                                                                        combined contexts di.
   Maximum number of contexts per script was 200. If there
were less then 200 contexts for some script then contexts tuple             The idea behind the attention mechanism can be described
was padded with zero-filled contexts:                                   as choosing the most interesting part of the resulting set. The
                                                                        softmax transformation (7) is a key component of several
                                                                (5)     statistical learning models but recently it has also been used to
    Artificial neural network architecture used in this research        design attention mechanisms in neural networks [13]. Attention
is shown in fig. 4. First of all, there is fully connected layer to     mechanisms are used to solve various applied problems with
which Dropout regularization method was applied. Thanks to              the help of artificial neural networks, e.g. multilingual
this 75% randomly chosen neurons are ignored (not considered            translation [14], sentiment classification [15], time-series
during forward pass) on each epoch. This helps to prevent               classification [16], vehicle images classification [17] or speech
over-fitting of training data and increases model performance           recognition [18].
on non-observed samples.                                                    At the last step the final solution is calculated using 128-
   Fully connected layer have tanh activation function:                 dimensional real vectors yobf and ynotobf: obfuscated or non-
                                                                        obfuscated script was passed to the network input. Vectors yobf
                                                                (6)     and ynotobf are initialized randomly and updated during model
                                                                        training. JavaScript program obfuscation probability q(v) is
where                       is a context vector representation,
                                                                        calculated based on code vector (7):
                    is a combined context vector representation

  Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

                                                               (9)                         TABLE I.      MODEL SCORES

                                                                          Metric                                      Value

   If q(v) > 0.5 then script is thought to be obfuscated. Script          Precision                                  84.9%
non-obfuscation probability is estimated as 1−q(v) respectively.          Recall                                     85.1%
For one script, the loss function (cross-entropy function) is
computed as follows:                                                      F1                                         85.0%

where p(v) = 1 for obfuscated scripts and p(v) = 0 for non-                Our model showed less well performance than the model
obfuscated scripts. To minimize the loss function, the method          proposed by Tellenbach et. al. Their model used features
of adaptive moment estimation (Adam) was used as an                    reflecting the frequencies of JavaScript keywords and other
optimization algorithm.                                                statistical statistical calculations and had following evaluations:
                                                                       precision – 95%, recall – 90%, F1-score – 92% [19].
            V.    MODEL TRAINIG AND EVALUATION                             At the same time the presented model has sufficient
    Model training and evaluation were proceeded on the                improvement potential gives rise to further research of
workstation with the following equipment: Intel Core i7-7700           obfuscation detection models that do not rely on pre-calculated
processor (3.6 GHz) with 8 cores, 16 GB of RAM, NVIDIA                 statistical features. First of all, second fully connected layer and
GeForce GTX 1080 GPU. The training dataset was formed as               activation function replacement with different one could
follows: 115504 context samples describing non-obfuscated              positively impact model quality scores.
functions and 117990 context samples describing obfuscated                 Beyond that, code vector v (7) can be passed to additional
functions, among them 36000 randomly chosen from all                   classifier input, e.g. SVM-based or Random Forest based. A
samples obfuscated with "javascript-obfuscator" , 36000                similar approach Ndichu et al. proposed to solve the JavaScript
randomly chosen from all samples obfuscated with "jfogs",              malware detection problem using feedforward neural network
36000 randomly chosen from all samples obfuscated with                 [20]. They divided model training process into two stages: on
"UglifyJS2" and 9990 – from samples obfuscated with                    the first one they trained neural network classifier based on
"defendjs". There were 233494 context samples in sum.                  Doc2vec and on the second one they passed fully connected
    A set Tp (|Tp|=776830) of the most popular names of start          layer output to the SMV. As a result, SVM was trained on the
and final context vertices and a set Pp (|Pp|=1008102) were            code embeddings [20]. Their model that combines Doc2vec
obtained from the training sample so that for each script s at         and SMV had following evaluation results: precision – 94%,
least one context c contains two elements from T and one               recall – 92% and F1-score - 93% on the obfuscated samples.
element from P.
                                                                                             VI. CONCLUSION
   Testing dataset contained 8444 contexts describing
obfuscated functions (7655 samples obfuscated with "gnirts"               In this paper we explored JavaScript (ECMAScript 2016)
and 789 samples obfuscated with "jfogs") and 8444 contexts             code obfuscation detection method that uses artificial neural
describing non-obfuscated functions.                                   network with attention mechanism as classifier algorithm.

    We decided to use precision (10), recall (11) and F1-score            First of all, a set of samples of obfuscated and non-
(12) as model evaluation metrics explaining model                      obfuscated code was obtained using projects and repositories
performance.                                                           hosted on Secondly, an artificial neural network
                                                                       model with an attention mechanism was adapted to solve the
                                                              (11)     problem of scripts classification on the obfuscation basis.
                                                                       Thirdly, non-obfuscated dataset could be checked for the
                                                                       presence of obfuscated samples uploaded to
                                                              (12)     repository, downloaded during the dataset preparation stage
                                                                       and erroneously labeled as non-obfuscated.
                                                                           The characteristics of the obtained model show that the
                                                              (13)     considered method potentially can be implemented with some
                                                                       improvements in malicious code detection systems, browser or
                                                                       mobile device fingerprint collection systems or other software
   Model training was 9 epochs long. Precision, recall and F1-         that use obfuscation recognition.
score obtained after model training completion are shown in
Table 1.

      Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)

                              REFERENCES                                       [12] Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav. Сode2vec:
                                                                                    Learning Distributed Representationsof Code. Proc. ACM Program.
                                                                                    Lang.3, POPL, 2019, 40, P. 1-29
[1]  N.P. Varnovsky, V.A.Zakharov, N.N. Kuzurin, V.A. Shokurov. The
     current state of art in program obfuscations:definitions of obfuscation   [13] Martins, Andre, and Ramon Astudillo. "From softmax to sparsemax: A
     security. Proceedings of the Institute for System Programming, vol. 26,        sparse model of attention and multi-label classification." In International
     issue 3, 2014, pp. 167-198. DOI: 10.15514/ISPRAS-2014-26(3)-9.                 Conference on Machine Learning, pp. 1614-1623. 2016.
[2] Diffie W., Hellman M. New directions in cryptography // IEEE               [14] Firat, Orhan, Kyunghyun Cho, and Yoshua Bengio. "Multi-way,
                                                                                    multilingual neural machine translation with a shared attention
     Transactions on Information Theory, IT-22(6), 1976, p.644-654.
                                                                                    mechanism." In 15th Conference of the North American Chapter of the
[3] Collberg C., Thomborson C., Low D. A Taxonomy of Obfuscating                    Association for Computational Linguistics: Human Language
     Transformations // Technical Report, N 148, Univ. of Auckland, 1997.           Technologies, NAACL HLT 2016, pp. 866-875. Association for
[4] Liu, H., Sun, C., Su, Z., Jiang, Y., Gu, M. and Sun, J., 2017, May.             Computational Linguistics (ACL), 2016.
     Stochastic optimization of program obfuscation. In 2017 IEEE/ACM          [15] Wang, Yequan, Minlie Huang, and Li Zhao. "Attention-based LSTM for
     39th International Conference on Software Engineering (ICSE) (pp. 221-         aspect-level sentiment classification." In Proceedings of the 2016
     231). IEEE.                                                                    conference on empirical methods in natural language processing, pp.
[5] Kozachok A., Bochkov M., Tuan L.M. Indistinguishable Obfuscation                606-615. 2016.
     Security Theoretical Proof. Voprosy kiberbezopasnosti [Cybersecurity      [16] Du, Qianjin, Weixi Gu, Lin Zhang, and Shao-Lun Huang. "Attention-
     issues], 2016. N 1 (14). P. 36-46.                                             based LSTM-CNNs For Time-series Classification." In Proceedings of
[6] Markin D., Makeev S. Protection System of Terminal Programs Against             the 16th ACM Conference on Embedded Networked Sensor Systems,
     Analysis Based on Code Virtualization. Voprosy kiberbezopasnosti               pp. 410-411. ACM, 2018.
     [Cybersecurity issues], 2020, N 1 (35), pp. 29-41. DOI: 10.21681/2311-    [17] Zhao, D., Chen, Y., & Lv, L. (2017). Deep Reinforcement Learning
     3456-2020-01-29-41.                                                            With Visual Attention for Vehicle Classification. IEEE Transactions on
[7] Schrittwieser, S., Katzenbeisser, S., Kinder, J., Merzdovnik, G., &             Cognitive and Developmental Systems, 9(4), 356–367.
     Weippl, E. (2016). Protecting Software through Obfuscation. ACM           [18] Kim, Suyoun, Takaaki Hori, and Shinji Watanabe. "Joint CTC-attention
     Computing Surveys, 49(1), 1–37.                                                based end-to-end speech recognition using multi-task learning." In 2017
[8] Barak, B. (2016). Hopes, fears, and software obfuscation. Commun.               IEEE international conference on acoustics, speech and signal
     ACM, 59(3), 88-96.                                                             processing (ICASSP), pp. 4835-4839. IEEE, 2017.
[9] Silvio Cesare, Yang Xiang. Software Similarity and Classification.         [19] Tellenbach B, Paganoni S, Rennhard M. Detecting obfuscated
     Springer-Verlag, 2012.                                                         JavaScripts from known and unknown obfuscators using machine
[10] Zhang, Jian, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang,                 learning. International Journal on Advances in Security.
     and Xudong Liu. "A novel neural source code representation based on            2016;9(3/4):196-206.
     abstract syntax tree." In Proceedings of the 41st International           [20] Ndichu, S., Kim, S., Ozawa, S., Misu, T. and Makishima, K., 2019. A
     Conference on Software Engineering, pp. 783-794. IEEE Press, 2019.             machine learning approach to detection of JavaScript-based attacks
[11] Alon, Uri, Meital Zilberstein, Omer Levy, Eran Yahav. A general path-          using AST features and paragraph vectors. Applied Soft Computing, 84,
     based representation for predicting program properties. ACM SIGPLAN            p.105721.
     Notices, vol. 53, no. 4, pp. 404-419. ACM, 2018.