The paper assesses the presence of vulnerable JS functions in active JavaScript projects in the real world. 9+ Millions of JS functions were tested against a dataset of vulnerable JS function build from VulnCode-DB and Snyk. The authors focus on prototype pollution and ReDoS (Regular Expression Denial of Service), as vulnerable implementations patterns may be detected automatically for these issues. The core idea is that a real project that include a vulnerable library it is not necessarily vulnerable if the vulnerable function is (i) actually used in the project (ii) accessible from the user. Pattern and textual-similarity-based approaches and multi-files static taint analysis were performed to assess the prevalence of prototype pollution and ReDoS.
Context
The study of vulnerabilities in JS dependencies has led to the flagging a number of projects as vulnerable, due to the presence of vulnerable dependencies. In reality, many such projects (73%) are not really vulnerable because they do not actually use the vulnerable functions.
There are insufficient reliable dataset of vulnerable code and therefore it is complicated to train models or to test other mitigation solution properly.
The objective of our study is to assess the presence of vulnerable JS functions in active JavaScript projects in the real world. We gather JS from 3 sources
- NPM packages
- Chrome web extensions
- top popular websites
And we test +9M of JS functions against a dataset of vulnerable JS functions we build
Approach
Creation of dataset of vulnerable functions
- we automatically collect vulnerable JavaScript functions from Snyk and VulnCodeDB (vulnerability databases), to compose an updated dataset. Only entries that present a link to the source code were taken into consideration
- Test files, empty functions or cases were both vulnerable and fixed functions were identical were ruled out
- Almost 5000 functions were found (895 entries), but only ReDoS (Regular Expression Denial of Service) (121 entries), and prototype pollution (101 entries) were considered
- 150 entries were manually verified (using a web application developed ad hoc for that to simplify the process) and studied to identify patterns
Identification and formalization of vulnerability patterns in ReDoS and PP
- new rules to detect ReDoS and prototype pollution were created with Semgrep (iterative process: the rule was guessed, the Semgrep script was run on some entries, some functions were flagged, then rule was improved to flag more functions, …)
- all rules are available at github.com/Marynk/JavaScript-vulnerability-detection/tree/main/semgrep (e.g.,
object[key] = value
for prototype pollution)
Finding new vulnerabilities in the wild
- We gather a large dataset of 9,205,654 JavaScript functions from active real-world projects from three different application types (NPM packages, Chrome extensions, and top websites). This collection process is also fully automated.
- A combination of pattern and textual-similarity-based approaches and of a STA (static taint analysis) with a novel representation of file dependency graphs were used to identify matches between real-world functions and the dataset of vulnerable functions that was created before. Real-world functions flagged are exploitable from a malicious user input
- For details on matching techniques see section
4.2
: content-sensitive hash comparison was used to transform the functions to compare into fixed-side strings. A similarity threshold can be set to evaluate the match - We detect 124,934 vulnerable functions from this real world dataset. The estimated average precision is 94.5%, based on manual verification of a small subset
- With our taint analysis, we identify 301 cases from 134 NPM packages (5.7% of all findings in NPM packages), which are exploitable in the project context. Manual verification of 100 cases detected no false positives produced by the taint analysis mechanism
Semi-automated reporting approach
- to deal with a large number of disclosure notices, we develop a semi-automated technique to report our findings. We first search for duplicates of our findings in the CVE database (and identify 19 cases) and then automatically compose readable vulnerability reports for the remaining 290 findings and send them to 112 responsible project developers
- 25 new public CVEs and 169 reserved CVEs were obtained
All the framework code and datasets is available at: https://github.com/Marynk/JavaScriptvulnerability-detection.