Authors Grigory Ponomarenko Petr Klyucharev
License CC-BY-4.0
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) JavaScript Programs Obfuscation Detection Method that Uses Artificial Neural Network with Attention Mechanism Grigory Ponomarenko, Petr Klyucharev Information Security Department Bauman Moscow State Technical University Moscow, Russia gs.ponomarenko@yandex.ru; pk.iu8@yandex.ru Abstract—In this paper, we consider JavaScript code Schrittwieser et al. remark, that at the beginning of the obfuscation detection using artificial neural network with computer era obfuscation was commonly used, in particular, to attention mechanism as classifier algorithm. Obfuscation is surprise users by displaying unexpected messages, but today widely used by malware writers that want to obscure malicious obfuscation is mostly used to protect intellectual property or intentions, e.g. exploit kits, and also it is a common component of obscure malicious intentions [7]. Boaz Barak notices, that intellectual property protection systems. Non-obfuscated obfuscation doesn't make protected program invincible, and JavaScript code samples were obtained from software repository obfuscated program should be protected from reverse service Github.com. Obfuscated JavaScript code samples were engineering as much as an encryption system shouldn't be created by obfuscators found on the same service. Before being broken using any sensible amount of time and computation fed to the network, each JavaScript program is converted to the general path-based representation, i.e. each program is described resources [8]. by the set of paths in an abstract syntax tree. Model proposed in this paper is a feedforward artificial neural network with II. PROBLEM DEFINITION attention mechanism. We aimed to build a model that relies on Obfuscated and non-obfuscated programs distinguishing AST paths structures instead of statistical features. According to problem is indelibly linked to the source code properties results of experiments, evaluated model potentially can be implemented with some improvements in malicious code prediction and various programs classification types. To detection systems, browser or mobile device fingerprint collection formalize the problem, we will use the definitions introduced systems etc. by Silveo Cesar and Yang Xiang in the first chapter of their book "Classification and similarity of programs" [9]. Keywords—obfuscation classification, obfuscated code, Let r be a property for program p if for all possible obfuscation recognition, Javascript obfuscation, general path- execution flaws r is true. A program q is called an obfuscated based representation, ECMAScripit obfuscation, AST-based copy of a program p if q is the result of transformations that pattern recognition preserve the semantics (meaning) of algorithms and data structures. Programs p and q are similar if they are based on the I. INTRODUCTION same program. According to Varnovsky et al. statements [1], obfuscation Let P be the set of source codes of programs, f1,..., fk are was firstly implicitly mentioned in 1976 in the famous Diffie functions that allocate features from program, i.e. fi: P → Di, and Hellman paper [2], in which they introduced asymmetric where Di is the i-th set of features. Let {p1, ... , pn} ⊂ P be the cryptography concept. Diffie and Hellman suggested inserting training sample, {0,1} = Y - class labels (1 is assigned to a secret key into the encryption program, and then this secret obfuscated programs, 0 is assigned to non-obfuscated). It is key initialized encryption program becomes tricky converted necessary to find the map s: D1×...×Dk → {0,1} using training so that the secret key extraction would be a very difficult task. sample {p1, ... , pn} that classifies all elements of P with the The concept of obfuscation was explicitly introduced in 1997 smallest error function value. in the Kollberg, Tomborson and Lowe paper [3]. By Han Liu et al. obfuscation is defined as special program III. DATASET PREPARATION transformation whose purpose is to obscure source code or binary code in order to hide implemented algorithms and data To create a dataset with obfuscated and non-obfuscated structures from being recovered [4]. Obfuscated program is Javascript code samples we used software repository obtained from original after applying obfuscation, and github.com. Github.com is one of the largest service platforms therefore original program is called non-obfuscated [5, 6]. that features software projects hosting and collaborative 100 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) development. There were downloaded 100 most popular not significantly change the syntactic structure of the program, JavaScript projects. To get a list of the most popular projects, e.g. jfogs and UglifyJS2, and mostly rename some identifier we used a special search API (referenced as Github Search and shuffle independent parts. Other obfuscators, such as gnirts API) provided by the service. Further all projects from the or defendjs, completely change the syntactic structure of the resulting list were cloned to the local machine. Downloading scripts. was done on March 22th, 2019 and all downloaded projects took up 7.3 Gb of the disk space. 49612 files with the ".js" PigeonJS library was used to extract features from the extension (excluding files with the ".min.js" extension) were JavaScript programs source codes [11]. It provides an API to retrieved from the obtained data. In order to simplify the get a list of given length paths on the AST. further creation of obfuscated code samples, it was decided to One path on the AST formed by PingeonJS has following retrieve functions from scripts. An example of a simple structure called general path-based representation [11]: JavaScript function is shown in fig. 1. (1) Node.js script that retrieves functions from a Javascript program was written using the Esprima library. With this The vertices are separated by v and ^ depending on whether library abstract syntax tree (AST) can be formed for any the left vertex is higher or lower on the tree in comparison with Javascript program that complies with the ECMAScript 2016 right one. One of the paths retrieved from the basic JavaScript standard. An abstract syntax tree for a script is formed program example (fig. 1) is shown in fig. 3. according to the syntactic rules of the programming language. You can apply the inverse transformation and generate the correct program code from the tree. Unlike to plain source code, ASTs do not include punctuation, delimiters, comments and some other details, but they can be used to describe the syntactic structure of the script along with lexical information [10]. Abstract syntax tree example based on the simple JavaScript program (fig. 1) is shown in fig. 2. There were built ASTs for all previously downloaded scripts with the extension ".js" (49612 samples) using the parseModule method provided by the Esprima API. Program code for each element of the "FunctionDeclaration" was saved into separate files during the tree traversal. Hereby 126276 files were produced, each file contained a JavaScript function, all files took up 527 Mb of the disk space. To generate obfuscated code samples, we used special programs that implement JavaScript code obfuscation. On the mentioned above github.com software repository hosting we found 6 obfuscators that fit our needs. They are listed below: • javascript-obfuscator/javascript-obfuscator • zswang/jfogs • anseki/gnirts • mishoo/UglifyJS2 • alexhorn/defendjs • wearefractal/node-obf Obfuscators can work in different ways. Some obfuscators do Fig. 2. AST of the basic JavaScript function Fig. 1. Basic JavaScripit function example Fig. 3. One of the paths extracted from the basic JavaScript function AST 101 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Not all scripts have been obfuscated with all of obfuscators listed before and contexts were retrieved not from all programs. Main purpose for this was that some of the downloaded JavaScript files contained programs that have nonstandard features and extensions. Besides, some obfuscated scripts took up to 1Gb file storage space although original scripts had size up to 200-300 Kb. We decided to take such samples away from the dataset. IV. NEURAL NETWORK ARCHITECTURE The architecture of an artificial neural network was used in this study is based on the network, proposed by Uri Alon et al. in their paper "code2vec: Learning Distributed Representations of Code" [12]. Researchers attempted to create an artificial neural network that predicts method names for programs written in Java. They got excellent results: at the time of the article publication, they had the best percentage of correctly named methods among all known studies — about 60%. So we decided to adapt that network for JavaScript code obfuscation recognition problem solving. Fig. 4. Neural network ahitecture scheme Main objects the network is working on are script contexts. The context si = (xs, p, xt) is a tuple containing three elements: the start vertex, the path, and the terminal vertex. Start vertex xs and terminal vertex xt are elements of the start and terminal and is a fully connected layer weights matrix. vertices set T. Path p is an element of the paths set P. Every Based on the combined contexts {d1, ... , d200} and the JavaScript program (it does not matter, is it obfuscated or non- attention vector α, the attention weights αi are calculated for obfuscated) is described with a set of contexts: each di. Vector α is initialized with random variables and (2) updated during the training. Each element x of the vertices set T has its own vector (7) representation vx in VectT (128-dimensional rational vector). Similarly, each element p of the paths set P has its own vector representation vp in VectP (128-dimensional rational vector). In Obviously, the sum of all αi equals 1. After that a code that way each context has 384-dimensional vector vector v is calculated using the attention weights αi as follows: representation that looks like this: (3) (8) Then each script is described by a tuple of contexts vector Since all attention weights αi are nonnegative, and their representations: sum equals to 1, we can consider the calculation of the code (4) vector as the calculation of the weighted average over all combined contexts di. Maximum number of contexts per script was 200. If there were less then 200 contexts for some script then contexts tuple The idea behind the attention mechanism can be described was padded with zero-filled contexts: as choosing the most interesting part of the resulting set. The softmax transformation (7) is a key component of several (5) statistical learning models but recently it has also been used to Artificial neural network architecture used in this research design attention mechanisms in neural networks [13]. Attention is shown in fig. 4. First of all, there is fully connected layer to mechanisms are used to solve various applied problems with which Dropout regularization method was applied. Thanks to the help of artificial neural networks, e.g. multilingual this 75% randomly chosen neurons are ignored (not considered translation [14], sentiment classification [15], time-series during forward pass) on each epoch. This helps to prevent classification [16], vehicle images classification [17] or speech over-fitting of training data and increases model performance recognition [18]. on non-observed samples. At the last step the final solution is calculated using 128- Fully connected layer have tanh activation function: dimensional real vectors yobf and ynotobf: obfuscated or non- obfuscated script was passed to the network input. Vectors yobf (6) and ynotobf are initialized randomly and updated during model training. JavaScript program obfuscation probability q(v) is where is a context vector representation, calculated based on code vector (7): is a combined context vector representation 102 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) (9) TABLE I. MODEL SCORES Metric Value If q(v) > 0.5 then script is thought to be obfuscated. Script Precision 84.9% non-obfuscation probability is estimated as 1−q(v) respectively. Recall 85.1% For one script, the loss function (cross-entropy function) is computed as follows: F1 85.0% (10) where p(v) = 1 for obfuscated scripts and p(v) = 0 for non- Our model showed less well performance than the model obfuscated scripts. To minimize the loss function, the method proposed by Tellenbach et. al. Their model used features of adaptive moment estimation (Adam) was used as an reflecting the frequencies of JavaScript keywords and other optimization algorithm. statistical statistical calculations and had following evaluations: precision – 95%, recall – 90%, F1-score – 92% [19]. V. MODEL TRAINIG AND EVALUATION At the same time the presented model has sufficient Model training and evaluation were proceeded on the improvement potential gives rise to further research of workstation with the following equipment: Intel Core i7-7700 obfuscation detection models that do not rely on pre-calculated processor (3.6 GHz) with 8 cores, 16 GB of RAM, NVIDIA statistical features. First of all, second fully connected layer and GeForce GTX 1080 GPU. The training dataset was formed as activation function replacement with different one could follows: 115504 context samples describing non-obfuscated positively impact model quality scores. functions and 117990 context samples describing obfuscated Beyond that, code vector v (7) can be passed to additional functions, among them 36000 randomly chosen from all classifier input, e.g. SVM-based or Random Forest based. A samples obfuscated with "javascript-obfuscator" , 36000 similar approach Ndichu et al. proposed to solve the JavaScript randomly chosen from all samples obfuscated with "jfogs", malware detection problem using feedforward neural network 36000 randomly chosen from all samples obfuscated with [20]. They divided model training process into two stages: on "UglifyJS2" and 9990 – from samples obfuscated with the first one they trained neural network classifier based on "defendjs". There were 233494 context samples in sum. Doc2vec and on the second one they passed fully connected A set Tp (|Tp|=776830) of the most popular names of start layer output to the SMV. As a result, SVM was trained on the and final context vertices and a set Pp (|Pp|=1008102) were code embeddings [20]. Their model that combines Doc2vec obtained from the training sample so that for each script s at and SMV had following evaluation results: precision – 94%, least one context c contains two elements from T and one recall – 92% and F1-score - 93% on the obfuscated samples. element from P. VI. CONCLUSION Testing dataset contained 8444 contexts describing obfuscated functions (7655 samples obfuscated with "gnirts" In this paper we explored JavaScript (ECMAScript 2016) and 789 samples obfuscated with "jfogs") and 8444 contexts code obfuscation detection method that uses artificial neural describing non-obfuscated functions. network with attention mechanism as classifier algorithm. We decided to use precision (10), recall (11) and F1-score First of all, a set of samples of obfuscated and non- (12) as model evaluation metrics explaining model obfuscated code was obtained using projects and repositories performance. hosted on github.com. Secondly, an artificial neural network model with an attention mechanism was adapted to solve the (11) problem of scripts classification on the obfuscation basis. Thirdly, non-obfuscated dataset could be checked for the presence of obfuscated samples uploaded to github.com (12) repository, downloaded during the dataset preparation stage and erroneously labeled as non-obfuscated. The characteristics of the obtained model show that the (13) considered method potentially can be implemented with some improvements in malicious code detection systems, browser or mobile device fingerprint collection systems or other software Model training was 9 epochs long. Precision, recall and F1- that use obfuscation recognition. score obtained after model training completion are shown in Table 1. 103 Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) REFERENCES [12] Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav. Сode2vec: Learning Distributed Representationsof Code. Proc. ACM Program. Lang.3, POPL, 2019, 40, P. 1-29 [1] N.P. Varnovsky, V.A.Zakharov, N.N. Kuzurin, V.A. Shokurov. The current state of art in program obfuscations:definitions of obfuscation [13] Martins, Andre, and Ramon Astudillo. "From softmax to sparsemax: A security. Proceedings of the Institute for System Programming, vol. 26, sparse model of attention and multi-label classification." In International issue 3, 2014, pp. 167-198. DOI: 10.15514/ISPRAS-2014-26(3)-9. Conference on Machine Learning, pp. 1614-1623. 2016. [2] Diffie W., Hellman M. New directions in cryptography // IEEE [14] Firat, Orhan, Kyunghyun Cho, and Yoshua Bengio. "Multi-way, multilingual neural machine translation with a shared attention Transactions on Information Theory, IT-22(6), 1976, p.644-654. mechanism." In 15th Conference of the North American Chapter of the [3] Collberg C., Thomborson C., Low D. A Taxonomy of Obfuscating Association for Computational Linguistics: Human Language Transformations // Technical Report, N 148, Univ. of Auckland, 1997. Technologies, NAACL HLT 2016, pp. 866-875. Association for [4] Liu, H., Sun, C., Su, Z., Jiang, Y., Gu, M. and Sun, J., 2017, May. Computational Linguistics (ACL), 2016. Stochastic optimization of program obfuscation. In 2017 IEEE/ACM [15] Wang, Yequan, Minlie Huang, and Li Zhao. "Attention-based LSTM for 39th International Conference on Software Engineering (ICSE) (pp. 221- aspect-level sentiment classification." In Proceedings of the 2016 231). IEEE. conference on empirical methods in natural language processing, pp. [5] Kozachok A., Bochkov M., Tuan L.M. Indistinguishable Obfuscation 606-615. 2016. Security Theoretical Proof. Voprosy kiberbezopasnosti [Cybersecurity [16] Du, Qianjin, Weixi Gu, Lin Zhang, and Shao-Lun Huang. "Attention- issues], 2016. N 1 (14). P. 36-46. based LSTM-CNNs For Time-series Classification." In Proceedings of [6] Markin D., Makeev S. Protection System of Terminal Programs Against the 16th ACM Conference on Embedded Networked Sensor Systems, Analysis Based on Code Virtualization. Voprosy kiberbezopasnosti pp. 410-411. ACM, 2018. [Cybersecurity issues], 2020, N 1 (35), pp. 29-41. DOI: 10.21681/2311- [17] Zhao, D., Chen, Y., & Lv, L. (2017). Deep Reinforcement Learning 3456-2020-01-29-41. With Visual Attention for Vehicle Classification. IEEE Transactions on [7] Schrittwieser, S., Katzenbeisser, S., Kinder, J., Merzdovnik, G., & Cognitive and Developmental Systems, 9(4), 356–367. Weippl, E. (2016). Protecting Software through Obfuscation. ACM [18] Kim, Suyoun, Takaaki Hori, and Shinji Watanabe. "Joint CTC-attention Computing Surveys, 49(1), 1–37. based end-to-end speech recognition using multi-task learning." In 2017 [8] Barak, B. (2016). Hopes, fears, and software obfuscation. Commun. IEEE international conference on acoustics, speech and signal ACM, 59(3), 88-96. processing (ICASSP), pp. 4835-4839. IEEE, 2017. [9] Silvio Cesare, Yang Xiang. Software Similarity and Classification. [19] Tellenbach B, Paganoni S, Rennhard M. Detecting obfuscated Springer-Verlag, 2012. JavaScripts from known and unknown obfuscators using machine [10] Zhang, Jian, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, learning. International Journal on Advances in Security. and Xudong Liu. "A novel neural source code representation based on 2016;9(3/4):196-206. abstract syntax tree." In Proceedings of the 41st International [20] Ndichu, S., Kim, S., Ozawa, S., Misu, T. and Makishima, K., 2019. A Conference on Software Engineering, pp. 783-794. IEEE Press, 2019. machine learning approach to detection of JavaScript-based attacks [11] Alon, Uri, Meital Zilberstein, Omer Levy, Eran Yahav. A general path- using AST features and paragraph vectors. Applied Soft Computing, 84, based representation for predicting program properties. ACM SIGPLAN p.105721. Notices, vol. 53, no. 4, pp. 404-419. ACM, 2018. 104