ChatGPT in Software Security: What Practitioners Think vs. What It Can Really Do

ChatGPT in Software Security: What Practitioners Think vs. What It Can Really Do

In this post we explore our recent study on using ChatGPT as a conversational assistant for software security tasks. We combine a perception survey from Twitter with a hands‑on vulnerability detection experiment, revealing that while developers love the speed of GPT‑style tools, the model’s outputs are often generic and only 61 % accurate. We explain the research design, highlight key results, and discuss what this means for the future of domain‑specific security LLMs.

Ali Babar

Ali Babar

9/25/2025

1. Why Study ChatGPT for Security?

The rapid proliferation of large‑language models (LLMs) has made conversational assistants like ChatGPT a familiar presence in many domains, yet their practical value for software security remains unclear. In our work we set out to answer two central questions: first, how do security practitioners actually view ChatGPT’s potential, and second, how well does the model perform when tasked with concrete vulnerability detection? These questions are motivated by the mismatch that often exists between hype and hard data in the security field. Practitioners frequently cite rapid code review and information retrieval as strengths, but concerns about reliability and the need for human verification persist.

Our study is grounded in a dual‑fold methodology that blends perception analysis with empirical evaluation. We mined 7,716 English tweets containing the keyword “ChatGPT” and security‑related terms, then manually coded 700 of those posts for demographics, discussion type, topic, and sentiment. This dataset reflects the real‑world discourse that shapes expectations about LLMs in security workflows. Parallel to this, we curated a set of 70 real‑world CVEs published after ChatGPT’s knowledge cutoff (September 2021). Each CVE was linked to a single‑function GitHub commit, providing a focused code context for the model to analyze. By feeding these snippets into GPT‑4 with a carefully engineered prompt—“Role: You are a software security expert. Instruction: Please analyze the following code snippet for potential security vulnerabilities. Provide a detailed explanation of the issues you find.”—we obtained 70 outputs that could be examined for correctness and usefulness.

The research questions that guided this design are explicitly stated in the paper: RQ1 – Perception: How do software security professionals perceive the use of ChatGPT for security tasks? and RQ2 – Practicality: How accurate and useful are ChatGPT’s outputs when tasked with vulnerability detection? RQ1 is answered through the Twitter analysis, which reveals that 54 % of the overall sentiment is positive, with practitioners enthusiastic about rapid code review and information retrieval. However, the same dataset shows that speculation tweets are negative (47 %) and that applied tweets carry mixed sentiment, indicating a cautious stance toward production‑level usage. RQ2 is quantified by measuring the model’s detection accuracy—43 of 70 vulnerabilities (61.4 %) were correctly identified—and by evaluating output quality. Seventy percent of responses contained generic security information, and half of the cases prompted for additional context, underscoring the model’s tentative nature.

The motivation for this research is threefold. First, the security community needs evidence about whether a general‑purpose LLM can reliably surface vulnerabilities or whether it merely reproduces generic best‑practice advice. Second, the findings illuminate the gap between user enthusiasm and empirical performance, informing both practitioners and tool developers about realistic expectations. Third, the results point toward the necessity of domain‑specific, fine‑tuned LLMs and rigorous human‑in‑the‑loop validation pipelines. The study therefore not only answers the posed research questions but also contributes actionable guidance for future research and industry adoption.

By aligning practitioner perceptions with measured accuracy, this work lays the foundation for a more nuanced understanding of conversational AI’s role in software security and highlights the imperative for specialized models that can move beyond generic prompts to deliver concrete, actionable insights in real‑world codebases.

2. How We Measured Perception and Practicality

The study was structured around two complementary data streams that together provided a holistic view of ChatGPT’s role in software security. First, a perception survey was extracted from a curated set of Twitter posts, giving an empirical sense of how practitioners talk about the model. Second, a controlled vulnerability‑detection experiment quantified the model’s practical performance on real‑world code. Both streams were designed to minimise bias and maximise reproducibility.

Data Collection

Twitter Corpus

The perception arm drew from 7,716 English tweets that mentioned ChatGPT and security‑related keywords between December 2022 and February 2023. After an initial automatic filtering to remove off‑topic content, 700 tweets were selected for manual coding. Each tweet was annotated for author demographics, topic category, sentiment, and discussion type. Cohen’s κ of 0.813 demonstrated high inter‑rater reliability, indicating that the coding scheme was robust and the annotations trustworthy.

Vulnerability Dataset

For the practicality arm, 70 Common Vulnerabilities and Exposures (CVEs) were pulled from the National Vulnerability Database. All CVEs were published after September 2021, thereby lying beyond GPT‑4’s knowledge cutoff. Each CVE was mapped to a single GitHub commit that contained a focused function fixing the issue. This mapping ensured that the code snippets fed to the model were minimal yet representative of real‑world vulnerability contexts.

Prompt Design and Execution

All queries were issued to GPT‑4 using a consistent prompt template:

Role: You are a software security expert.
Instruction: Please analyze the following code snippet for potential security vulnerabilities. Provide a detailed explanation of the issues you find.

The template explicitly framed the model as an expert, encouraging the generation of detailed and actionable responses. Every query was run in a brand‑new chat session to prevent any cross‑session context leakage that could bias the results. Outputs were captured verbatim and stored alongside the corresponding CVE and code snippet for subsequent qualitative analysis.

Experiment Setup

The experiment involved 70 code snippets, each paired with its associated CVE. For each snippet, the model produced a vulnerability report. Two authors independently examined the 70 outputs, marking whether the model had correctly identified the vulnerability. The process yielded a 61.4 % detection rate: 43 of the 70 vulnerabilities were correctly flagged.

Beyond accuracy, the content of the responses was evaluated for quality. Seventy percent of the outputs contained generic security information—such as high‑level guidelines or theoretical implications—rather than concrete, actionable findings. Half of the responses requested additional context, signaling uncertainty or insufficient information. The language used by the model also reflected a cautious tone; higher‑severity issues prompted more tentative phrasing, and assertive statements were rare.

These metrics together paint a detailed picture of both perception and practical capability. The Twitter survey shows enthusiastic, albeit cautious, user sentiment, while the experimental results reveal that ChatGPT can locate vulnerabilities in many cases but frequently lacks the depth needed for production‑grade security work.

3. What We Found

In our perception study, 700 tweets were coded for demographics, topic, and sentiment. 54 % of the overall sentiment was positive, with practitioners expressing enthusiasm for rapid code review and information retrieval. However, speculation tweets were negative (47 %) and applied tweets carried mixed sentiment, indicating a cautious stance toward production‑level usage. The Twitter analysis also revealed that 39 % of the users were security practitioners, 13 % represented security companies, 9 % were software practitioners, 23 % were blogs, and 16 % were other. Topics most frequently discussed were vulnerability detection (29 %), vulnerability exploits (27 %), information retrieval (15 %), code analysis (12 %), and other (17 %).

For practicality, we evaluated GPT‑4 on 70 real‑world CVEs published after September 2021. The model correctly identified 43 of the 70 vulnerabilities, yielding a 61.4 % detection rate. The outputs were qualitatively analyzed for usefulness. Seventy percent of responses contained generic security information such as high‑level guidelines or theoretical implications, and 50 % of the cases prompted for additional context, indicating uncertainty. The language used by the model was cautious, with higher‑severity issues eliciting more tentative phrasing and rarely assertive statements. These findings illustrate that while ChatGPT can locate vulnerabilities in many cases, the accompanying explanations are often broad and lack actionable detail.

Overall, the study shows a mismatch between user enthusiasm and empirical performance: practitioners see ChatGPT as a helpful assistant, yet the tool’s outputs are frequently generic and require human verification before they can be trusted for production use.

4. Implications for Practitioners and Researchers

The results highlight a clear trade‑off. On the one hand, the 61.4 % detection accuracy and the rapid response time of ChatGPT make it a useful conversational assistant for drafting vulnerability reports, teaching secure coding, or conducting initial code reviews. On the other hand, the high proportion of generic or tentative responses, the frequent need for additional context, and the lack of actionable detail mean that the model cannot yet replace a human security analyst for production‑grade fault localisation.

These findings motivate several research and practice directions. First, the data suggest that fine‑tuning a large‑language model on curated security datasets can improve accuracy and reduce generic advice. Second, incorporating contextual awareness—such as code structure, project metadata, or user intent—could help the model generate more precise findings. Third, a robust human‑in‑the‑loop validation pipeline is essential; the model should serve as a first‑pass filter whose outputs are verified by a qualified engineer before deployment. Finally, the speculative tweets in the perception study point to potential malicious uses, underscoring the need for responsible AI guidelines that mitigate misuse.

For researchers, the study provides a concrete benchmark of GPT‑4 performance on post‑knowledge‑cutoff CVEs and a methodology for combining perception surveys with empirical testing. For practitioners, the study offers evidence that while conversational AI can accelerate certain security tasks, it remains an assistive tool rather than a fully autonomous solution.

5. Takeaways

ChatGPT can identify vulnerabilities in real‑world code with a 61.4 % success rate, but its explanations are often generic and require human verification. Security practitioners view the model as a fast, helpful assistant for information retrieval and code review, yet they remain cautious about deploying it in production environments. The mismatch between enthusiasm and empirical performance points to the need for domain‑specific, fine‑tuned LLMs that incorporate contextual information and are validated through rigorous human‑in‑the‑loop workflows. Future research should explore severity estimation reliability, advanced prompt engineering to increase confidence, and large‑scale industrial evaluations to confirm the findings.

In short, conversational AI can accelerate certain security activities, but until models are specialized and coupled with robust validation pipelines, they should be used as augmentative tools rather than replacements for experienced security professionals.