VulGuard: A Unified Tool for Evaluating Just‑In‑Time Vulnerability Prediction Models
AI

VulGuard: A Unified Tool for Evaluating Just‑In‑Time Vulnerability Prediction Models

This research introduces VulGuard, a fully automated, end-to-end framework designed to overcome existing limitations in Just-in-time vulnerability prediction by streamlining the entire process from data mining to model deployment.

Ali Babar

Ali Babar

9/25/2025

1. Introduction

Modern software systems evolve through millions of commits, often leaving a trail of subtle security flaws. Just‑in‑time vulnerability prediction (JIT‑VP) seeks to flag potentially vulnerable code changes at the moment they are committed, enabling developers to prioritize remediation before the flaw propagates into production. Despite its promise, JIT‑VP research has been limited by fragmented tooling, inconsistent data pipelines, and a lack of reproducible benchmarks. The VulGuard framework addresses these gaps by providing a fully automated, end‑to‑end solution for mining, labeling, training, evaluating, and deploying JIT‑VP models.

2. Architecture & Pipeline

VulGuard implements a fully automated end‑to‑end pipeline for Just‑In‑Time Vulnerability Prediction (JIT‑VP). The pipeline is composed of five tightly coupled stages that operate on a Git repository and produce a set of trained models ready for inference.

  1. Data Mining – The mining stage clones the target repository and extracts commit metadata. Commit messages, diffs, and blame information are collected, and a refined V‑SZZ algorithm labels vulnerable‑introducing commits (VICs) and vulnerable‑fixing commits (VFCs). The mining tool writes the mined data into a structured JSONL file that serves as the foundation for downstream processing.
  2. Feature Engineering – The extracted commits are transformed into a rich feature set that spans code quality metrics, developer activity, temporal attributes, and optional property graphs via Joern for graph‑based models.
  3. Data Splitting – To prevent information leakage, the data is partitioned chronologically: 75 % for training, 5 % for validation, and 20 % for testing.
  4. Model Training & Evaluation – VulGuard supports eight JIT‑VP models. For each model, the training component reads the engineered features, trains the predictor, and tunes hyper‑parameters on the validation set. After training, the evaluation module computes standard binary‑classification metrics and effort‑recall metrics.
  5. Inference – The inference stage loads a trained model and applies it to new commits in a streaming fashion, returning a probability that a given commit will introduce a vulnerability.

3. Evaluation Results

VulGuard’s evaluation framework is designed to reflect two contrasting settings that highlight the gap between research‑grade and production‑grade performance of JIT‑VP models. The ideal setting uses curated, noise‑free datasets where every commit is correctly labeled. The realistic setting mirrors noisy commit streams found in large open‑source projects. In the ideal setting, all eight supported models achieve high precision. The most accurate model, JITFine, attains a PR‑AUC of 0.959 and an MCC of 0.864. In the realistic setting, PR‑AUC values drop by over 90 %. Even the best model, JITFine, falls to 0.111 PR‑AUC. These results underscore that current models are fragile and that additional robustness is required before deployment.

4. Usage & Integration

VulGuard is distributed as a Python package that can be installed with pip install vulguard. The package exposes a command‑line interface (CLI) that covers the full JIT‑VP lifecycle: mining, training, evaluating, and inference. In addition, the library can be imported as a normal Python module, allowing developers to embed the workflow directly into CI/CD pipelines or custom tooling.

Command‑Line Workflow

  • Mining – Clone a repo, extract commit data, and label commits with V‑SZZ.
  • Training – Train the selected model with engineered features and hyper‑parameter tuning.
  • Evaluation – Compute PR‑AUC, MCC, F1, ROC‑AUC, and effort‑recall metrics.
  • Inference – Score new commits and return vulnerability probability.

The CLI can be invoked in CI jobs, for example in GitHub Actions, to perform continuous vulnerability assessment.

5. Future Directions

The VulGuard framework is designed to be extensible. Future work focuses on:

  1. Ensemble and Hybrid Models – Combine complementary predictors to reduce variance caused by noisy commits.
  2. Large Language Model (LLM) Integration – Fine‑tune LLMs on commit diffs and metadata for function‑level vulnerability prediction.
  3. Expanded Datasets and Vulnerability Types – Incorporate additional open‑source projects and domain‑specific vulnerabilities.
  4. Continuous Learning – Explore online learning strategies for models to adapt incrementally.
  5. Community‑Driven Benchmarking – Build a platform for researchers to submit models, datasets, and evaluation scripts.

6. Conclusion

VulGuard bridges the gap between academic JIT‑VP research and practical software security workflows. By providing an end‑to‑end, reproducible pipeline, our tool allows researchers to benchmark models on large, real‑world codebases and developers to integrate continuous vulnerability assessment into their CI/CD pipelines. The evaluation results reveal a stark contrast between ideal and realistic scenarios, highlighting the need for more robust algorithms. Moving forward, VulGuard will evolve to accommodate ensemble learning, LLM inference, and broader datasets, ultimately turning JIT‑VP into a dependable security practice at the velocity of modern DevOps.