Deep representation learning for software vulnerability detection
File(s)
Author(s)
Md Hanif, Mohamad Hazim
Type
Thesis
Abstract
Software vulnerabilities pose a significant threat to the security of software systems, particularly in the context of the growing number of open-source software projects. Detecting vulnerabilities in C/C++ code presents unique challenges due to the complex and low-level nature of these programming languages. Traditional approaches relying on conventional methods and manual analysis are time-consuming, inefficient, and prone to misclassification.
This thesis addresses critical issues related to vulnerability detection, including training data size, detection performance, model size, and robustness of detection models. To establish a systematic understanding of the topic, we conducted a comprehensive review of existing literature in software vulnerability detection, resulting in the development of a taxonomy that encompasses relevant research interests and machine learning approaches. Building upon this knowledge foundation, we curated datasets comprising raw C/C++ source code extracted from diverse GitHub repositories, as well as a vulnerability dataset derived from merging information from the National Vulnerability Database and GitHub commits. Leveraging these datasets, we evaluated emerging deep learning approaches such as Abstract Syntax Trees Neural Networks (ASTNNs), Graph Neural Networks (GNNs), and Transformers to determine their suitability for vulnerability detection.
Furthermore, we proposed VulBERTa, a simplified pre-training model that leverages limited pre-training data to acquire syntactical and contextual information from raw source code. Notably, VulBERTa achieved state-of-the-art detection performance and demonstrated remarkable effectiveness across various datasets and vulnerability detection benchmarks, including D2A and CodeXGLUE. To enhance the performance of deep learning-based vulnerability detection models, we explored techniques that supplement these models with additional information, such as source code metrics, diverse code representations, more pretraining data and alternative tasks. Intriguingly, our experimental results indicated that our initially proposed model already captures relevant information without relying on these supplementary techniques.
Additionally, we introduced VulBench, a comprehensive framework designed to assess the robustness of black-box detection models by subjecting them to a series of vulnerability-related tasks. This framework facilitates the ranking of models and techniques through a leaderboard based on their scores across different tasks. To gain further insights into black-box models, we conducted in-depth case studies, manually investigating their decision boundaries in selected Common Vulnerabilities and Exposures (CVEs).
We hope that this thesis serves as a foundational study and a valuable reference for future research in software vulnerability detection, particularly in the domain of deep representation learning. It contributes novel approaches, such as VulBERTa and VulBench, to address the challenges associated with vulnerability detection, and establishes new benchmarks for evaluating the performance and robustness of detection models.
This thesis addresses critical issues related to vulnerability detection, including training data size, detection performance, model size, and robustness of detection models. To establish a systematic understanding of the topic, we conducted a comprehensive review of existing literature in software vulnerability detection, resulting in the development of a taxonomy that encompasses relevant research interests and machine learning approaches. Building upon this knowledge foundation, we curated datasets comprising raw C/C++ source code extracted from diverse GitHub repositories, as well as a vulnerability dataset derived from merging information from the National Vulnerability Database and GitHub commits. Leveraging these datasets, we evaluated emerging deep learning approaches such as Abstract Syntax Trees Neural Networks (ASTNNs), Graph Neural Networks (GNNs), and Transformers to determine their suitability for vulnerability detection.
Furthermore, we proposed VulBERTa, a simplified pre-training model that leverages limited pre-training data to acquire syntactical and contextual information from raw source code. Notably, VulBERTa achieved state-of-the-art detection performance and demonstrated remarkable effectiveness across various datasets and vulnerability detection benchmarks, including D2A and CodeXGLUE. To enhance the performance of deep learning-based vulnerability detection models, we explored techniques that supplement these models with additional information, such as source code metrics, diverse code representations, more pretraining data and alternative tasks. Intriguingly, our experimental results indicated that our initially proposed model already captures relevant information without relying on these supplementary techniques.
Additionally, we introduced VulBench, a comprehensive framework designed to assess the robustness of black-box detection models by subjecting them to a series of vulnerability-related tasks. This framework facilitates the ranking of models and techniques through a leaderboard based on their scores across different tasks. To gain further insights into black-box models, we conducted in-depth case studies, manually investigating their decision boundaries in selected Common Vulnerabilities and Exposures (CVEs).
We hope that this thesis serves as a foundational study and a valuable reference for future research in software vulnerability detection, particularly in the domain of deep representation learning. It contributes novel approaches, such as VulBERTa and VulBench, to address the challenges associated with vulnerability detection, and establishes new benchmarks for evaluating the performance and robustness of detection models.
Version
Open Access
Date Issued
2023-05
Date Awarded
2023-08
Copyright Statement
Creative Commons Attribution NonCommercial Licence
License URL
Advisor
Maffeis, Sergio
Publisher Department
Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)