Evaluating Just‑In‑Time Vulnerability Prediction in Real‑World Development
AI

Evaluating Just‑In‑Time Vulnerability Prediction in Real‑World Development

This research addresses the practical viability of Just-In-Time Vulnerability Prediction (JIT-VP) by conducting a realistic evaluation, contrasting it with idealized settings, and revealing a substantial performance drop when accounting for the full spectrum of commit types, including neutral commits.

Ali Babar

Ali Babar

9/25/2025

1. Introduction & Motivation

We live in an era where software systems grow in size, complexity, and inter‑dependence. A single overlooked vulnerability can cascade into widespread outages, data breaches, and financial loss. Consequently, the security community has long pursued early detection techniques that can flag problematic code before it is merged and deployed. One promising line of research is Just‑In‑Time Vulnerability Prediction (JIT‑VP), which aims to predict whether a particular commit is likely to introduce a security flaw. By providing feedback at the moment a developer submits code, JIT‑VP has the potential to halt vulnerable changes in the development pipeline, thereby reducing the cost and effort associated with downstream bug‑fixing and patching.

While the idea of instant feedback is attractive, the practical viability of JIT‑VP has been questioned. Prior studies have typically relied on idealised datasets that contain only two classes of commits: those that introduce vulnerabilities (VICs) and those that fix them (VFCs). In real world workflows, however, the majority of commits are neutral—neither creating nor resolving security issues. These neutral commits (VNCs) dominate the development stream and pose a severe class‑imbalance problem for any learning algorithm. Existing evaluations that ignore VNCs tend to overestimate predictive performance, giving an unrealistic picture of a model’s effectiveness in production settings.

Our work addresses this gap by conducting a realistic evaluation of JIT‑VP. We curated a dataset of over 1.08 million commits from two large open‑source projects—FFmpeg and the Linux kernel—capturing the full spectrum of commit types (VIC, VFC, VNC). Commit labels were derived through a combination of the V‑SZZ algorithm, developer‑informed heuristics, and manual verification, ensuring high‑quality ground truth. With this dataset, we systematically compared eight state‑of‑the‑art JIT‑VP techniques (including VCCFinder, CodeJIT, LR, TLEL, DeepJIT, LAPredict, SimCom, and JITFine) under two distinct evaluation regimes:

  1. Idealised – training and testing on the balanced set of VIC and VFC commits.
  2. Realistic – incorporating the overwhelming number of VNC commits, thereby reflecting the true distribution encountered by developers.

The metrics we selected for evaluation—PR‑AUC, MCC, F1‑score, and ROC‑AUC—are well‑suited to highly skewed datasets. PR‑AUC and MCC, in particular, are robust against severe class imbalance, providing a more honest assessment of a model’s ability to correctly identify rare vulnerable commits.

Our findings reveal a stark contrast between the two regimes. In the idealised setting, models achieve high PR‑AUC scores (average > 0.8) and demonstrate that JIT‑VP can, in principle, deliver accurate predictions. However, once VNC commits are introduced, the vulnerable‑to‑safe ratio plummets to 1:17 in FFmpeg and 1:217 in Linux. Under these conditions, PR‑AUC drops by more than 90 %, with the best models scoring as low as 0.015 in the Linux dataset. Generic imbalance‑mitigation techniques (oversampling, undersampling, SMOTE, OSS, focal loss) yield only marginal gains and sometimes degrade performance. These results have significant implications for both researchers and practitioners.

In the sections that follow, we detail the construction of our benchmark, the experimental protocol, the empirical results, and the lessons learned for the broader security and machine‑learning communities.

2. Building a Realistic Benchmark

We assembled a benchmark that mirrors the complexity of real‑world software development by collecting over 1.08 million commits from two large, actively maintained open‑source projects: the FFmpeg multimedia framework and the Linux kernel. The dataset contains the full history of each commit, including the source code diff, the engineered feature vector that encodes static code metrics, and the rich metadata (author, timestamp, branch, and issue identifiers) that is normally available to a continuous integration pipeline.

Commit Taxonomy

We categorised each commit into one of three mutually exclusive classes:

Class Description Labeling Rationale Vulnerability‑Introducing Commit (VIC) A commit that introduces a security‑critical defect. Identified as vulnerable by our labeling pipeline and confirmed through manual inspection. Vulnerability‑Fixing Commit (VFC) A commit that patches a known vulnerability. Marked as safe; the bug is already present in the repository history. Vulnerability‑Neutral Commit (VNC) A commit that does not affect vulnerability status. Also marked as safe; these are the majority of commits in any active project.

The inclusion of VNCs is essential because, in practice, a developer cannot filter out neutral commits before training a model. Omitting them would create a selection bias that inflates performance metrics and misrepresents deployment scenarios.

Labeling Process

We adopted a multi‑stage pipeline to generate ground‑truth labels:

  1. V‑SZZ Extraction – We ran the V‑SZZ algorithm on the full commit history to associate each commit with the set of vulnerabilities that it introduced or fixed. V‑SZZ leverages issue tracker links and code ownership to trace the origin of a defect.
  2. Developer‑Informed Heuristics – For commits where V‑SZZ could not resolve a vulnerability status (e.g., missing issue links), we applied heuristics based on commit messages, file paths, and known vulnerability patterns. This step captures many subtle cases that V‑SZZ overlooks.
  3. Manual Verification – A team of security analysts reviewed a random sample of commits flagged as VICs and VNCs to ensure that the automated pipeline had not mislabelled any commit. The verification step confirmed a false‑positive rate below 1 %.

VICs are labelled vulnerable, whereas VFCs and VNCs are labelled safe. The resulting class distribution is highly imbalanced, with vulnerable commits constituting only about 0.1 % of the dataset (ratio 1:1700 for Linux and 1:18 for FFmpeg). This imbalance reflects the true prevalence of vulnerabilities in large codebases.

Public Availability

To encourage reproducibility, we released the full benchmark on Figshare under a CC‑BY license. The dataset package includes:

  • The raw commit diffs and metadata files.
  • The engineered feature matrix used in our experiments.
  • The label vector that maps each commit to VIC, VFC, or VNC.

In parallel, we published the VulGuard tool on GitHub, which provides utilities for downloading the benchmark, reproducing the labeling pipeline, and generating the feature set from a fresh code checkout. The combination of a static dataset and a dynamic tool enables other researchers to extend the benchmark to new projects or to refine the labeling heuristics.

By constructing the benchmark in this manner, we preserve the authenticity of the development workflow while providing a comprehensive, reproducible resource for evaluating JIT‑VP models under realistic conditions.

3. Experimental Protocol

We evaluated eight state‑of‑the‑art Just‑In‑Time Vulnerability Prediction (JIT‑VP) techniques on the public dataset of 1,081,882 commits extracted from FFmpeg and the Linux kernel. The methods comprise VCCFinder, CodeJIT, Logistic Regression (LR), TLEL, DeepJIT, LAPredict, SimCom, and JITFine. Our goal was to understand how each model behaves under two distinct evaluation settings that mirror different assumptions about the data available during development.

Evaluation Settings

Idealised Setting

In the idealised scenario we restrict the training and test sets to only Vulnerability‑Introducing Commits (VICs) and Vulnerability‑Fixing Commits (VFCs). This setting reproduces the common practice in prior literature, where the model is assumed to see only commits that either introduce or remove vulnerabilities. The class distribution in this setting is heavily skewed in favour of VFCs, yet the overall imbalance is milder than in the realistic setting. This setting allows us to benchmark the best achievable performance when the training data is free of neutral commits.

Realistic Setting

The realistic setting expands the data pool to include Vulnerability‑Neutral Commits (VNCs), which represent the bulk of commits in a real development workflow. Under this setting the class ratios become 1:17 for FFmpeg and 1:217 for Linux, reflecting a true production imbalance. We evaluate the same eight models on the full VIC+VFC+VNC datasets to assess how the abundance of safe commits impacts predictive power.

Metrics

To capture model behaviour across both settings we report four complementary metrics:

  • PR‑AUC (Area Under the Precision‑Recall Curve) focuses on the model’s ability to correctly identify vulnerable commits in the presence of a large number of safe commits. PR‑AUC is especially informative when the positive class is rare, as it penalises false positives more heavily than ROC‑AUC.
  • MCC (Matthews Correlation Coefficient) is a balanced measure that remains informative even under extreme class imbalance by combining true positives, true negatives, false positives, and false negatives into a single correlation coefficient.
  • F1‑score captures the harmonic mean of precision and recall, providing a single‑number summary that is sensitive to both types of error.
  • ROC‑AUC remains useful for completeness, as it measures discriminative ability regardless of class distribution.

The choice of PR‑AUC and MCC reflects our emphasis on robustness to class imbalance, a key challenge highlighted by the 1:217 ratio in the Linux realistic dataset.

Experimental Procedure

For each of the eight methods we performed the following steps:

  1. Feature extraction: We extracted the same engineered features used in the original studies (e.g., commit message length, number of files changed, code churn metrics) and combined them with commit metadata such as author, date, and repository path.
  2. Dataset partitioning: The dataset was split into training and test folds that preserve the commit chronology to avoid leakage. In the idealised setting, splits were performed only on VIC and VFC commits. In the realistic setting, VNC commits were included in both training and test folds.
  3. Model training: Each method was trained on the training fold using its default hyper‑parameters as reported in the literature. No additional tuning was performed beyond what is required for each method to operate.
  4. Evaluation: We computed the four metrics on the held‑out test fold. For methods that output class probabilities, we generated precision‑recall and ROC curves by varying the decision threshold.
  5. Imbalance mitigation experiments: For the realistic setting we additionally applied generic imbalance‑handling techniques—Random Oversampling (ROS), Random Undersampling (RUS), SMOTE, One‑Sided Selection (OSS), and focal loss—to evaluate their effect on each model’s PR‑AUC and MCC.

Observations

The experimental protocol revealed stark differences between the two settings. In the idealised setting all models achieved PR‑AUC values above 0.8 on average, with JITFine attaining 0.96 on FFmpeg and 0.89 on Linux. However, when VNC commits were introduced, PR‑AUC collapsed by more than 90 % in the realistic setting, falling to as low as 0.015 on Linux. The MCC mirrored this trend, dropping from values around 0.6 in the idealised setting to near zero in the realistic scenario. The imbalance mitigation techniques only yielded marginal gains—e.g., PR‑AUC increased from 0.092 to 0.114 on FFmpeg with RUS—while in several cases (ROS, focal loss) the performance of DeepJIT and SimCom deteriorated, indicating that generic techniques do not translate well to the JIT‑VP domain. These results underscore the necessity of realistic evaluation protocols. The dramatic performance degradation in the realistic setting shows that models previously considered effective may not generalise to production workloads where neutral commits dominate. Our protocol therefore provides a benchmark that faithfully reflects the challenges faced by developers and security analysts in real projects.

4. Findings & Insights

The realistic setting overwhelms the rare vulnerable commits with a vast number of neutral commits. In FFmpeg the vulnerable‑to‑safe ratio is 1:17, and in the Linux kernel it is 1:217. This extreme imbalance is the core reason why the performance of every JIT‑VP model collapses when evaluated on real‑world data.

Performance Gap Between Idealised and Realistic Settings

Setting PR‑AUC (Avg.) Best Model PR‑AUC (Best) Idealised (VIC + VFC) > 0.8 JITFine 0.96 (FFmpeg) / 0.89 (Linux) Realistic (VIC + VFC + VNC) 0.015 (Linux) TLEL 0.114 (FFmpeg)

The drop of more than 90 % in PR‑AUC is not a numerical artifact but a direct consequence of the data distribution. In the realistic setting models must correctly flag a single vulnerable commit among hundreds of safe commits, a task that is far beyond what the current architectures were designed for.

Effectiveness of Generic Imbalance‑Mitigation Techniques

We applied five common techniques to the realistic dataset:

  • Random Oversampling (ROS)
  • Random Undersampling (RUS)
  • SMOTE (Synthetic Minority Over‑Sampling Technique)
  • One‑Sided Selection (OSS)
  • Focal Loss

The results were uniformly disappointing. RUS produced the largest relative gain for FFmpeg, raising PR‑AUC from 0.092 to 0.114. All other techniques either yielded negligible improvements or, in the case of ROS and focal loss, caused the models DeepJIT and SimCom to collapse entirely. These findings highlight that generic balancing strategies are ill‑suited for the JIT‑VP problem and that the community must develop domain‑specific solutions.

Ranking Shift Among Models

In the idealised setting JITFine ranked first. Once VNCs were introduced, the ranking flipped: TLEL emerged as the best performer, achieving the highest PR‑AUC under realistic conditions. This shift demonstrates that model popularity in the literature may be an artifact of the evaluation protocol rather than an indicator of real‑world performance.

Practical Implications

  1. Researchers must adopt realistic protocols that include VNCs, otherwise their results over‑estimate the effectiveness of JIT‑VP models.
  2. Practitioners should be cautious when interpreting performance numbers from the literature; a high PR‑AUC in an idealised setting does not guarantee utility in a production pipeline.
  3. The extreme class imbalance is the fundamental bottleneck. Future work must focus on techniques that explicitly address this imbalance—e.g., tailored data augmentation, generative modeling of vulnerable commits, or new loss functions that penalise false negatives more severely.

By exposing these limitations, we aim to steer future research toward methods that are both effective in theory and viable in practice.

5. Future Directions & Take‑aways

Our realistic evaluation of Just‑In‑Time Vulnerability Prediction (JIT‑VP) exposes a significant performance gap compared with the idealised benchmark. The extreme class imbalance—approximately 1 vulnerable commit for every 17 to 217 safe commits—drives PR‑AUC down by more than 90 % and shows that generic data‑balancing tricks such as random undersampling, oversampling, SMOTE, OSS, and focal loss provide only marginal improvements. These results indicate that the core challenge is not simply the scarcity of positive examples but the idiosyncratic structure of vulnerability‑related code changes.

Domain‑Specific Imbalance Handling

A promising research direction is to develop domain‑specific imbalance‑handling methods. Rather than applying generic resampling or loss‑weighting strategies, we can design techniques that explicitly account for the patterns that characterize vulnerable commits. For example, tailored data augmentation could generate synthetic vulnerable changes that preserve the key security‑related patterns while varying superficial code features. Likewise, generative models could produce realistic vulnerable code snippets conditioned on vulnerability labels, thereby enriching the training set with high‑quality examples.

Expanding to Related Tasks

The insights from this study also motivate exploration of other vulnerability‑management tasks. We identified opportunities in areas such as automated patch prioritisation and related downstream applications. By extending the realistic evaluation framework to these tasks, we can assess whether the same imbalance challenges persist and whether the domain‑specific techniques that succeed for JIT‑VP transfer to these settings.

In summary, our work highlights the necessity of realistic evaluation protocols and points to concrete future research avenues that can bring JIT‑VP models closer to practical deployment.