Crest Lab - Enterprise AI & Security Solutions

Introduction & Motivation

Large language models (LLMs) such as GPT‑4 and the CodeT5p family have demonstrated a remarkable ability to generate executable code from natural‑language descriptions. However, the quality of the generated code is highly contingent on the prompt supplied to the model. Minor variations in wording, token ordering, or the inclusion of auxiliary context can lead to significant differences in accuracy, style, and semantic correctness. This sensitivity makes prompt engineering a time‑consuming and expertise‑heavy task.

Existing automatic prompt‑tuning methods provide some relief but introduce their own limitations. Approaches like BBT, Prefix‑Tuning, and P‑Tuning typically optimise discrete token sequences or continuous prompt embeddings while requiring explicit storage of large parameter vectors for each target model or task. Moreover, these methods are largely model‑specific and struggle to generalise across architectures, often yielding sub‑optimal performance when transferred to a new LLM or a new prompt length.

To address these gaps, we sought a technique that (1) learns continuous prompt embeddings without the need to store separate vectors for each model, (2) directly utilises the LLM’s own code‑generation loss as a training signal, and (3) remains scalable to any LLM and to prompts of arbitrary length. The motivation behind these desiderata is threefold:

Parameter Efficiency – By avoiding explicit storage of prompt embeddings, we reduce memory overhead and simplify deployment.
Direct Optimisation – Leveraging the LLM’s loss ensures that the optimisation objective is tightly coupled to the end task of code generation, rather than relying on proxy metrics or surrogate losses.
Cross‑Model Generalisation – A framework that operates purely at the embedding level can, in principle, be applied to any transformer‑based LLM, making it a versatile tool for the community.

The proposed solution, Diffusion‑Driven Prompt Tuning (DDPT), fulfils these goals by training a diffusion model to map pure Gaussian noise to a prompt embedding that minimises the LLM’s code‑generation loss. At inference time, DDPT samples a high‑quality prompt from the learned distribution in roughly 30 s, obviating the need for manual prompt engineering or the storage of large parameter sets. In our experiments across several CodeT5p variants, including the instruction‑finetuned InstructCodeT5p‑16B, DDPT consistently outperforms manual prompts and existing prompt‑tuning baselines on BLEU‑4, CodeBLEU, and METEOR metrics.

These findings demonstrate that diffusion models can serve as powerful, parameter‑efficient optimisers for prompt embeddings in LLM‑based code generation. By directly learning a mapping from Gaussian noise to an embedding that minimises the LLM’s loss, we can generate high‑quality prompts on demand, paving the way for more robust, scalable, and transferable prompt‑tuning solutions in the field of natural‑language‑to‑code generation.

Methodology: DDPT

The Diffusion‑Driven Prompt Tuning (DDPT) framework introduces a continuous prompt representation and learns a mapping from Gaussian noise to an optimised prompt embedding using a diffusion model. The prompt is decomposed into a context prefix that is tuned and an instruction that remains fixed. The context tokens are first mapped to the LLM’s embedding matrix, yielding a dense vector of dimension 10 240. This vector is the target of the diffusion process.

Prompt Representation

The context prefix is treated as a fixed‑length vector; the instruction tokens are concatenated after the context during generation but are not modified by the diffusion model. By keeping the instruction constant, DDPT preserves the semantic intent of the task while allowing the context to steer the LLM toward the desired output.

Diffusion Model Design

The core of DDPT is a denoising diffusion probabilistic model (DDPM). The forward process injects Gaussian noise into the context embedding according to the schedule

[\mathbf{X}_t = \sqrt{\alpha_t},\mathbf{X}_0 + \sqrt{1-\alpha_t},\boldsymbol{\epsilon}]

where (\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})) and (\alpha_t) is a pre‑defined variance schedule. The reverse process is modelled by a neural network that takes the noisy embedding (\mathbf{x}_t) and projects it to a lower‑dimensional latent space, applies a series of non‑linear transformations, and then up‑projects back to the original dimension. The network outputs a directional vector that is added to (\mathbf{x}_t) to produce the next denoised sample.

Unlike standard diffusion objectives that minimise a reconstruction loss, DDPT optimises the diffusion network using the LLM’s code‑generation loss. Specifically, for each denoised prompt, the frozen LLM generates code conditioned on the prompt and the instruction, and the loss (cross‑entropy over the token sequence) is back‑propagated through the diffusion network. This loss, detailed in Equation (3) of the paper, directly encourages the diffusion model to produce prompts that lower the LLM’s generation error.

Sampling Procedure

During inference, DDPT starts from a pure Gaussian noise vector at timestep 2000. It iteratively applies the learned reverse denoising steps, each time refining the prompt embedding. After 2000 steps the denoised embedding is added to the original context embedding to form the final prompt vector. The entire sampling procedure takes roughly 30 seconds per prompt, making it efficient for practical use.

The resulting prompt embeddings occupy a task‑specific semantic region of the embedding space. t‑SNE visualisations show that optimized prompts cluster around action‑related tokens such as editor, learn, and player, indicating that the diffusion model learns to emphasise semantic features that the LLM associates with high‑quality code generation.

By training the diffusion model solely on the LLM’s own loss signal and operating in the continuous embedding space, DDPT eliminates the need to store large discrete prompt vectors or tune individual token parameters. The framework thus offers a scalable, model‑agnostic method for generating high‑quality prompts on demand for large language model‑based code generation tasks.

Experimental Evaluation

The experimental evaluation of DDPT is carried out on two benchmark datasets for natural‑language‑to‑code generation: CodeAlpaca, a 20 k example dataset, and the CoNaLa corpus for Python. We evaluate the method across four variants of the CodeT5p family: the 2 B, 6 B, and 16 B parameter models, and the instruction‑finetuned InstructCodeT5p 16B. Decoding is performed greedily, i.e., no sampling or beam search is used, which keeps inference deterministic. The metrics reported are the standard code‑generation measures: BLEU‑4, CodeBLEU, METEOR, ChrF, and Rouge‑L. The diffusion model is trained with 2 000 denoising steps for all experiments, matching the design described in the methodology section.

Results demonstrate that DDPT consistently outperforms both manually engineered prompts and existing prompt‑tuning baselines on BLEU‑4 and CodeBLEU. For the largest model, InstructCodeT5p 16B, evaluated on CodeAlpaca, DDPT achieves a BLEU‑4 score of approximately 0.43 compared with 0.38 for the manual prompt, while CodeBLEU remains competitive at roughly 0.35 versus 0.33 for the manual prompt. On the CoNaLa dataset, the performance gains are less uniform for smaller models but become pronounced for the 16 B and instruction‑finetuned variants, indicating that DDPT scales with model capacity. Qualitatively, generated code produced with DDPT prompts shows improved semantic fidelity: in a Fibonacci example, the base‑case logic is correctly implemented, whereas a manually prompted version contains an off‑by‑one error. In a dictionary‑sorting scenario, DDPT’s output aligns closely with the reference implementation. Visual analysis of the optimized prompt embeddings using t‑SNE reveals that the vectors cluster within a region dense with action‑related tokens such as editor, learn, and player, suggesting that the diffusion model learns a task‑specific semantic subspace.

Several limitations and threats to validity are acknowledged. The evaluation is confined to the CodeT5p family; comparison with other large language models such as GPT‑4 or LLaMA‑Code was not performed due to resource constraints. Greedy decoding may underestimate the potential performance of the models, as beam search or temperature‑based sampling could yield higher scores; such decoding strategies were not explored in this work. Execution‑based correctness metrics, notably Pass@k, are absent, limiting the assessment of functional correctness. Finally, the prompt embeddings are of fixed length, which restricts prompt flexibility; handling variable‑length prompts is left for future research.

Discussion & Future Work

The DDPT framework demonstrates a clear advantage over existing prompt‑tuning baselines by directly learning a mapping from Gaussian noise to an optimised prompt embedding that minimises the LLM’s own code‑generation loss. Across multiple CodeT5p variants—including the instruction‑finetuned InstructCodeT5p 16B—DDPT achieved higher BLEU‑4 and CodeBLEU scores, with the InstructCodeT5p 16B model reaching a BLEU‑4 of approximately 0.43 versus 0.38 for the best manual prompt. This improvement is consistent across CodeAlpaca and CoNaLa, though the smaller models show mixed gains on CoNaLa, suggesting a model‑size dependency on the diffusion process. In qualitative evaluations, DDPT prompts produced code that was more semantically correct; a Fibonacci implementation produced the correct base‑case logic, whereas the manual prompt introduced an off‑by‑one error, and a dictionary‑sorting example aligned more closely with the ground truth under DDPT. t‑SNE visualisations of the optimized prompt embeddings revealed clustering around a task‑specific semantic subspace populated by action‑related tokens such as editor, learn, and player, indicating that the diffusion model converges to a region of the embedding space that is well suited to the code‑generation task.

Despite these promising results, several limitations temper the generality of the findings. First, the evaluation is confined to the CodeT5p family; broader comparisons with other LLMs such as GPT‑4 or LLaMA‑Code were infeasible due to resource constraints, leaving open the question of cross‑model transferability. Second, the decoding strategy employed greedy decoding; the absence of beam search or temperature sampling may under‑estimate the true potential of the prompts. Third, the study relied on metric‑based correctness measures (BLEU‑4, CodeBLEU, METEOR, ChrF, Rouge‑L) and did not incorporate execution‑based correctness checks such as Pass@k, which could reveal discrepancies between textual similarity and functional accuracy. Fourth, the prompt embedding length was fixed, preventing the model from handling variable‑length prompts that might capture richer contextual cues.

Future work should address these gaps by extending DDPT to additional instruction‑following LLMs, thereby testing its universality. Controlled diffusion techniques could be explored to steer prompt sampling toward specific properties, such as code style or language, offering a higher degree of controllability. Incorporating positional embeddings or attention mechanisms would allow the diffusion model to process prompts of arbitrary length, aligning the method with real‑world prompt engineering practices. Efficiency gains could be pursued through reduced diffusion steps or accelerated samplers while maintaining or improving prompt quality. Finally, integrating execution‑based correctness metrics and safety checks will help ensure that the generated code not only matches reference outputs but also operates reliably and securely.

In sum, DDPT provides a scalable, parameter‑efficient approach to prompt optimisation for LLM code generation, yet its applicability to a wider range of models, decoding strategies, and evaluation metrics remains an open research avenue.

Conclusion

Diffusion‑driven prompt tuning (DDPT) establishes a new paradigm for optimizing prompt embeddings in large language model (LLM) code generation. By treating the prompt as a continuous vector and learning a direct mapping from Gaussian noise to an embedding that minimises the LLM’s own code‑generation loss, DDPT eliminates the need to store large prompt vectors or perform discrete token search. The diffusion network, trained with a loss that directly reflects the LLM’s performance on target code, learns to produce prompts that lie in a task‑specific semantic subspace, as evidenced by t‑SNE visualizations showing clustering around action‑related tokens.

In our experiments across the CodeT5p family (2B, 6B, 16B) and the instruction‑finetuned InstructCodeT5p 16B, DDPT consistently outperformed manual prompts and existing prompt‑tuning baselines on BLEU‑4, CodeBLEU, METEOR, and other metrics. For example, on the CodeAlpaca dataset, InstructCodeT5p 16B achieved a BLEU‑4 of approximately 0.43 with DDPT, surpassing the manual prompt BLEU‑4 of 0.38 while maintaining a CodeBLEU close to 0.35. Similar gains were observed on the CoNaLa dataset for larger models, demonstrating the scalability of the approach to models with varying parameter counts.

The inference procedure is efficient: starting from Gaussian noise at timestep 2000, the diffusion model iteratively denoises to produce an optimised prompt embedding in roughly 30 seconds. This fast sampling, combined with the absence of a storage overhead for prompt vectors, makes DDPT suitable for real‑time or on‑the‑fly prompt generation in production settings.

Beyond raw metrics, qualitative evaluation revealed that DDPT‑generated prompts produce more semantically correct code. In a Fibonacci example, the generated base‑case logic was accurate, whereas a manual prompt led to an off‑by‑one error. In a dictionary‑sorting task, DDPT’s output closely matched the ground truth, indicating that the learned prompts capture essential problem‑specific cues.

The method’s strengths are tempered by certain limitations. Experiments were confined to the CodeT5p family; generalization to other LLMs such as GPT‑4, LLaMA‑Code, or Codex remains an open question. Greedy decoding was used, potentially underestimating performance, and no execution‑based correctness metrics (e.g., Pass@k) were incorporated. Additionally, the current prompt representation assumes a fixed‑length context, limiting flexibility for variable‑length prompts.

Future research directions include cross‑model generalization, controlled diffusion to steer prompt properties (e.g., code style or target language), and integration of execution‑based checks for robustness. Accelerating the diffusion sampler or reducing the number of denoising steps could further enhance efficiency without compromising quality.

In summary, DDPT demonstrates that diffusion models can serve as powerful, parameter‑efficient optimizers for prompt embeddings in LLM‑based code generation. By directly learning a mapping from noise to an embedding that minimises the LLM’s loss, the approach delivers high‑quality prompts on demand, surpassing manual and existing tuning baselines across multiple models and datasets, and laying a foundation for future work in prompt optimization and controllable code generation.

DDPT: Diffusion‑Driven Prompt Tuning for Large Language Model Code Generation