IARPA is seeking information about annotated binary and source code datasets to explore authorship (for benign tools, malicious tools, or penetration testing tools) that have been annotated with relevant information that could be obtained and made available to research partners in potential future programs.
A challenge in the ability to explore the authorship of malicious binaries and source code is the availability of sizeable annotated datasets. This challenge is caused by the need for annotations in the data indicating authorship. Datasets containing additional metadata, beyond authorship, may also be useful in bridging additional scientific gaps in source code or binary authorship research. One of the largest binary datasets in academic literature is the APT Class dataset, and at least one research paper references datasets obtained from commercial companies; few details are provided about these datasets.
In order to explore malicious source code authorship, surrogate data may be necessary, as malicious source code may be limited or non-existent. Google Code Jam (GCJ) datasets appear to be one of the more common datasets used for exploring authorship of source code, however they are generally smaller snippets of code. The concept of determining authorship of both source code and binaries is an interesting challenge, but few datasets merge source code with compiled binaries. One known example is in Caliskan-Islam et al., where they compiled C/C++ code within the Google Code Jam datasets.
For more information and submission instructions, please visit SAM.gov
Published Date: Sep 23, 2022
Response Date: Oct 10, 2022 05:00 pm EDT