Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits

Keshav Ram, Achyudh Ram

dc.contributor.author	Keshav Ram, Achyudh Ram
dc.date.accessioned	2020-07-15 20:27:54 (GMT)
dc.date.available	2020-07-15 20:27:54 (GMT)
dc.date.issued	2020-07-15
dc.date.submitted	2020-07-06
dc.identifier.uri	http://hdl.handle.net/10012/16061
dc.description.abstract	Public vulnerability databases such as CVE and NVD account for only 60% of security vulnerabilities present in open-source projects and are known to suffer from inconsistent quality. Over the last two years, there has been considerable growth in the number of known vulnerabilities across projects available in various repositories such as NPM and Maven Central. However, public vulnerability management databases such as NVD suffer from poor coverage and are too slow to add new vulnerabilities. Such an increasing risk calls for a mechanism to promptly infer the presence of security threats in open-source projects. In this thesis, we seek to address this problem by treating the identification of security-relevant commits as a classification task. Since existing literature on neural networks for commit classification is sparse, we first turn to document classification for inspiration. Extensive research in this domain, on the other hand, has resulted in increasingly complex neural models, with a number of researchers questioning the necessity of such architectures. We conduct a large-scale reproducibility study of several recent neural network models, and show that well-executed, simpler models are quite effective for document classification. We find that a simple bi-directional LSTM with regularization yields competitive accuracy and F1 on four benchmark document classification datasets. Based on trends in document classification and the domain-specific peculiarities of commit classification, we build a family of hierarchical neural network models for the identification of security-relevant commits. We evaluate five different input representations and show that models that learn on tokens extracted from the commit diff are simpler and more effective than models that learn from path-contexts extracted from the AST. We also show that providing the models with contextual information through features extracted from the source code improves accuracy and F1 further, and discuss why path-based models might not capture any additional information compared to token-based models for this task. Finally, we make a case for reporting standard deviation of test scores across multiple runs in order to avoid erroneous conclusions and establish robust baselines.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	security vulnerabilities	en
dc.subject	security-relevant commits	en
dc.subject	neural networks	en
dc.subject	regularization	en
dc.subject	path-based representations	en
dc.subject	open source software	en
dc.title	Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits	en
dc.type	Master Thesis	en
dc.pending	false
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Master of Mathematics	en
uws.contributor.advisor	Nagappan, Meiyappan
uws.contributor.advisor	Lin, Jimmy
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: KeshavRam_AchyudhRam.pdf
Size:: 1.729Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record