CVE Awareness | Ben Dixon

A colourful background from https://www.color4bg.com/.

In a nutshell:

Accurate understanding of models' offensive cybersecurity capability is important
Existing cybersecurity benchmarks may struggle to distinguish between zero-day and one-day capability due to possible dataset contamination
This research intends to analyse this possible effect

Why I’m doing this

I’ve been looking for ways to contribute to AI safety and security research, such as working on implementations of benchmarks such as The Agent Company. Working on multi-container set ups in Inspect led me to seeing cybersecurity benchmarks. With the release of Claude Mythos and lots of public discussions of how capable LLMs are at finding and exploiting vulnerabilities, I wanted to see how these benchmarks worked. I’ve been looking into CVE Bench and making contributions.

As I worked on it, I started to wonder whether these benchmarks might suffer from dataset contamination. After conversations with a few different people (who I will check if they’re ok to be named here), I came up with the following research proposal, which I aim to look into in the next few weeks. I believe that this research would be useful in understanding cybersecurity capabilities, for reasons I give below.

On LLM usage - I asked Claude for some feedback on some early bullet points, and to suggest relevant papers. I did the rest of the thinking and writing myself.

My research proposal

Motivation

There is a pressing need to understand the capability of AI models to carry out offensive cybersecurity tasks such as gaining access to privileged data, or disabling a system. But current benchmarks may suffer from dataset contamination as models’ training data is often not stated publicly. Memorisation of previous exploits may lead to inflated performance. This research aims to investigate how well models recall specific CVEs and whether this leads to improved ability to find exploits.

Existing research

There are many existing benchmarks which aim to simulate target systems in varying levels of complexity. For example, Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, CVE-Bench tests agents’ ability to exploit real-world CVEs, and the Catastrophic Cyber Capabilities Benchmark (3CB) provides 15 original challenges. However, many of these benchmarks are becoming saturated, and researchers are creating more realistic and multi-stage environments. Folkerts et al 2026 test two realistic environments: 32-step corporate network attack and a 7-step industrial control system.

Some of these benchmarks, like Cybench, are reported on by model providers, such as Anthropic in the Opus 4.5 system card. These benchmarks are used to infer how capable models are, and then used to make decisions over deployment such as who gets access to models, and the level of monitoring and oversight from model providers. They also give an indication of how the frontier of open weight models is progressing. These benchmarks therefore need to give a realistic indication of model performance to help make these decisions which can have major safety and security implications.

However, many of these benchmarks could suffer from dataset contamination. Models’ training data is often not public, and published knowledge cut-off-dates may be only a short time before models are released. This may be a particularly important problem for cybersecurity tasks because solutions are often specific to a particular application version, and so recall could significantly improve task success.

For example, CVE-Bench has a “zero-day” setting, e.g. here, but if models are recalling public CVEs from their training data, then perhaps it is better characterised as a one-day evaluation. It is still important to evaluate models’ ability to recall and exploit one-day vulnerabilities, but this requires a different set of skills to developing zero day exploits.

In other contexts, researchers have argued that there is some evidence that model performance is due to memorisation rather than generalisable reasoning capacity. For example, Liang et al, 2025 uses the ability of models to recall arbitrary details from SWE Bench tasks to indicate that models are likely to have been trained on those tasks, and that as a result this would inflate performance. Prathifkumar et al, 2025 compare recall of arbitrary details from SWE Bench and performance, compared to their own unseen dataset of similar tasks from other Github issues.

On a different set of tasks, Wu et al, 2024 use ‘counterfactual’ tasks to test whether models can still perform a set of 11 tasks when typical assumptions are varied, such as using a different arithmetic base. Roberts et al, 2024 perform a natural experiment, finding that models perform much better on tasks published before the model’s release date.

One challenge with this literature is establishing clear causality between a model’s recall of a task, and whether this leads to improved success rate. It may be that models do memorise, but still are able to reason more generally.

Method

This research aims to examine to what extent models are familiar with details of CVEs, and whether this leads to improved success. I propose two different approaches. The first is using a longitudinal natural experiment with CVEs published before and after a model’s cut-off date. This would involve:

Testing recall of CVEs and details from the code, e.g. line, commit, function name. Based on Liang et al 2025, Prathifukumar et al, 2025
Assessing the models ability to identify and exploit CVEs, before and after the cut-off date
- Either setting up the environment for each CVE in CVE-Bench
- Or at least describing the environment and seeing if they can spot the exploit

The second approach is using counterfactual perturbations of CVEs. This involves:

Describing similar but fictitious applications with embedded vulnerabilities and seeing whether models spot the same weaknesses in code when the names are perturbed, or the functions are laid out differently. Based on Wu et al, 2024, and also Lau et al 2026

The first approach would only give suggestive evidence but seems easier to test experimentally. The second approach using counterfactual perturbations may give clearer evidence, since it would be a more direct comparison, but may be harder to test experimentally.

Hypotheses

There are several hypotheses which will jointly give information:

There is a statistically significant difference between the recall of information related to CVEs before compared to after a model’s knowledge cut-off date
There is a statistically significant difference between the ability of models to find and exploit CVEs before compared to after a model’s knowledge cut-off date
There is a statistically significant difference between model’s ability to exploit published CVEs compared to similar synthetic counterfactual CVEs

To make this valid we will need to account for varying levels of difficulty of CVEs, how well-known the codebase is, and to check we are comparing similar CVE types.

Dataset

The full research will use 20-30 CVEs. To start, I propose a small pilot using only three CVEs, which are chosen so they’re before and after a model’s knowledge date. This sample size is not enough to draw conclusions, but will help to set up the experimentation infrastructure.

CVE-2025-3248 Langflow 2025-04-07 (within)
CVE-2025-32434 PyTorch 2025-04-18 (within)
CVE-2025-55161 Stirling-PDF 2025-08-11 (after)

Again the full research will use several models. For a research pilot, I propose just one model, since this model has a less recent training date.

Claude-sonnet-4-5-20250929, training cutoff Jul 2025

A note on bias

This research aims to take an open mind as to whether models do exhibit general performance on cybersecurity tasks. A researcher thinking that model performance may be overstated, as providers have an incentive to game metrics improve perception and sales, may go in seeking to find that models rely on memory. On the other hand, a researcher concerned about the rapid pace of improvements, particularly on security tasks, may want to show that these results are genuine. To this researcher, both opposition motivations seem reasonable! This experiment intends to be as transparent and truth-seeking as possible. Also, it may be the case that some models, perhaps from some providers, display a greater memory of CVEs than others.

Why I’m doing this

My research proposal

A note on bias

Enjoy Reading This Article?