METAL: A Metamorphic Testing Framework for Large Language Model Quality Assessment
AI

METAL: A Metamorphic Testing Framework for Large Language Model Quality Assessment

This research introduces a novel metamorphic testing paradigm, operationalized through the METAL framework, to address the critical shortcomings of traditional LLM quality assurance methods by providing a scalable, annotation-free, and comprehensive assessment pipeline.

Ali Babar

Ali Babar

9/25/2025

Introduction & Motivation

We are witnessing an unprecedented integration of large language models (LLMs) into products that influence decision-making, content creation, and information retrieval. The quality of these models—robustness to input perturbations, fairness across demographic groups, consistency of outputs, and computational efficiency—directly affects user trust and downstream outcomes. Consequently, systematic assessment of LLM quality is not a theoretical exercise but a practical necessity for developers, regulators, and users alike.

Traditional quality-assurance approaches for LLMs fall into three broad categories: benchmark evaluation on curated datasets, manual human-in-the-loop testing, and adversarial or stress testing. Benchmark datasets such as GLUE or SuperGLUE provide a convenient yardstick, but they are static, limited to a handful of tasks, and fail to capture the breadth of real-world inputs. Manual testing, while flexible, scales poorly; it requires costly human annotation, suffers from inter-annotator variability, and introduces bias when the annotator pool is not diverse. Adversarial testing attempts to expose weaknesses by crafting targeted perturbations, yet it traditionally relies on hand-crafted rules or human intuition, which limits coverage and reproducibility.

Moreover, most existing methods focus on a single quality attribute. For instance, robustness studies typically measure accuracy drop under noisy inputs, whereas fairness analyses often rely on demographic-aware dataset splits. In practice, an LLM must simultaneously satisfy multiple attributes; a failure in one can cascade into others. For example, a model that is robust to character-level noise may still exhibit demographic bias if the perturbations do not reflect real-world linguistic diversity.

Another critical shortcoming is the dependence on labeled data. Labeled corpora are expensive to produce and may not reflect the distribution of inputs that a deployed model will encounter. An annotation-free, data-driven approach that can automatically generate challenging test cases would dramatically lower the barrier to comprehensive LLM evaluation.

These limitations motivate the need for a new paradigm: metamorphic testing. By leveraging a set of formal metamorphic relations (MRs) that capture expected input-output behavior, we can generate systematic, reproducible perturbations without requiring labels. Metamorphic testing scales across multiple tasks and quality dimensions, and it can harness the generative power of LLMs themselves to produce challenging test cases. In the following sections, we describe how the METAL framework operationalizes this paradigm to deliver a scalable, annotation-free quality assessment pipeline that addresses the gaps left by existing methods.

METAL Framework Overview

We designed METAL as a modular, fully automated pipeline that transforms raw, unlabelled text into a rigorous quality assessment of large language models (LLMs). The architecture is intentionally linear yet flexible, allowing us to plug in new perturbation functions or MR templates with minimal re-engineering.

Core Components

  1. MR Templates – Five formal templates codify the expected relationship between an original input x and a perturbed input x′ for each quality attribute (QA). They are:
    • Equivalence: f(x) ≈ f(x′) – the model should produce identical or highly similar outputs.
    • Discrepancy: f(x) ≠ f(x′) – a deliberate change in the target property is expected.
    • Set-Equivalence: f(x) ≈ f(x′) ∧ f(x′) ≈ f(x″) – tests consistency across a set of related inputs.
    • Distance: d(f(x), f(x′)) ≤ τ – bounds the semantic distance between outputs.
    • Set-Distance: max d(f(x), f(x′)) ≤ τ – extends Distance to a set of inputs.
      These templates are expressed as predicate functions that receive the model outputs and return a binary pass/fail signal.
  2. Perturbation Functions – Thirteen functions manipulate the original text to generate x′. They are grouped into semantic-preserving (e.g., character-swap, synonym-replace) and semantic-altering (e.g., convert-to-leet, assign-demographic-group) categories. Each function takes a text string and outputs a perturbed variant, optionally accompanied by a semantic quality score (e.g., BLEU or cosine similarity) used later in the effectiveness metric.
  3. Execution Module – This layer orchestrates the pipeline. For each original input, it invokes the chosen perturbation functions to produce a set of perturbed inputs, sends both the original and perturbed texts to the target LLM via the official API, respecting token limits, and logs the raw LLM responses along with metadata (timestamp, prompt, session ID).
  4. Evaluation Module – Once responses are collected, this module applies the MR templates to each pair (or set) of outputs. It computes the Attack Success Rate (ASR), i.e., the fraction of MRs that fail, and records the binary outcome per MR type.
  5. Automation & Environment – The entire workflow is implemented in Python 3.11.4, executed in a conda environment with 16 GB RAM. The automation script manages API key rotation, request batching, and error handling, ensuring reproducibility across experiments.

Data Flow Diagram (Conceptual)

[Original Text] ──► [Perturbation Functions] ──► [Perturbed Text]
│ │
│ ▼
└─► [Execution Module] ──► [LLM API] ──► [Raw Output]


[Evaluation Module] ──► [MR Pass/Fail]


[ASR Calculation & Logging]

At runtime, each original input is processed independently, guaranteeing that perturbations do not leak context between runs. The Evaluation Module operates after all outputs are available, applying the MR predicates and producing a binary matrix that records which MRs were satisfied.

MR Generation and Perturbation Workflow

  1. MR Generation – For each QA, we instantiate all applicable templates. For robustness, we generate 240 MRs; for fairness, 21 demographic-group MRs; and 6 MRs for non-determinism and efficiency.
  2. Perturbation Application – We apply the 13 perturbation functions to each original input, producing multiple x′ variants. The perturbation quality is quantified using a semantic similarity metric (e.g., cosine similarity over sentence embeddings). This score is later combined with ASR to compute an Effectiveness Metric (EFM).
  3. Execution & Evaluation – Both x and x′ are sent to the target LLM; the resulting outputs are evaluated against the MR templates. Pass/fail results are aggregated per MR type and per LLM.

By chaining these components, METAL transforms raw text into a structured assessment of LLM quality attributes without manual labeling. The modularity of MR templates and perturbation functions allows us to extend the framework to new tasks or QAs with minimal code changes.

MR Generation & Automation

We generate and execute a large set of metamorphic relations (MRs) automatically, leveraging both handcrafted perturbation functions and large-language-model (LLM) prompts. The entire pipeline is scripted in Python 3.11.4 and runs on a single 16 GB RAM machine; the scripts are bundled with a Conda environment file and published in the METAL GitHub repository.

Scale of MR Generation

  • Total MRs: 273
    • 240 for robustness (Equivalence, Discrepancy, Set-Equivalence, Distance, Set-Distance templates)
    • 21 for fairness (21 demographic groups × 2 templates: Equivalence & Distance)
    • 6 for non-determinism (Set-Equivalence) and efficiency (Distance)
  • Input Corpus: 900 unlabelled texts drawn from Amazon review snippets, news article abstracts, and Wikipedia headings. Token lengths span 15 – 4 k, ensuring coverage of both short and long contexts.
  • Execution Volume: For each target LLM we send ~42 000 API requests, totaling ~19 150 000 tokens per model. The five MR-generation methods (hand-crafted, LLM-prompted, hybrid, random, and baseline) are each run five times to quantify variance. This workload is achieved within a few hours per LLM on a single GPU instance.

Automation Workflow

  1. Perturbation Module – Applies 13 predefined functions (semantic-preserving and semantic-altering) to the original text. Each function is parameterised (e.g., number of swaps, synonym pool) and can be composed into a single perturbation.
  2. LLM-Based Generation – For the self/cross-examination paradigm, we prompt the target LLM to produce perturbations that satisfy a given MR template. The prompt includes a brief description of the template and a few illustrative examples; the LLM returns a JSON-encoded list of perturbed strings.
  3. Execution Engine – Submits both the original and perturbed inputs to the target LLM via the OpenAI or vendor API. Each call is isolated in a new session to avoid context contamination.
  4. Evaluation Module – Applies the MR template to the paired outputs. For equivalence-type MRs we compare semantic similarity; for discrepancy-type MRs we compute output divergence using cosine distance over the last hidden state.
  5. Result Aggregation – Generates a pass/fail matrix per MR, logs API usage, and records token counts. The data is stored in a SQLite database for downstream analysis.

Self/Cross-Examination Paradigm

The novel "LLM-as-tester" approach consists of two roles:

  • Self-examination – The same LLM that is being tested generates perturbations for itself. Because the perturbation logic is embedded in the prompt, the LLM can produce domain-specific edits (e.g., inserting demographic markers) that are otherwise difficult to hand-craft.
  • Cross-examination – The target LLM is tested against perturbations produced by a different LLM. For example, we generate MR-valid sentences with ChatGPT and evaluate Google PaLM’s responses to them. This cross-model testing exposes weaknesses that are model-specific versus universally problematic.

Our experiments demonstrate that ChatGPT-generated MRs achieve a higher effectiveness metric (EFM) than handcrafted or PaLM-generated MRs across all three target LLMs. The self-examination MRs also reveal subtle robustness gaps that are invisible to hand-crafted perturbations, confirming that an LLM can act as a competent test generator.

By automating MR generation, execution, and evaluation, we achieve a fully reproducible, annotation-free quality assessment pipeline that scales to thousands of tests without human intervention. The resulting pass/fail statistics form the basis for the experimental results presented in Section 4.

Experimental Evaluation

We conducted a comprehensive experimental campaign to answer the three research questions (RQ1–RQ3) posed by METAL. The evaluation leveraged three commercial and open-source LLMs—Google PaLM, OpenAI ChatGPT (3.5‑Turbo), and Meta Llama 2—across six core language-model tasks: toxicity detection, sentiment analysis, news classification, question-answering, summarization, and information retrieval. Our protocol followed the design described in the paper’s "Experimental Design & Results" section, ensuring reproducibility and statistical rigor.

Experimental Setup

  • Corpus: 900 unlabelled texts from Amazon reviews, news articles, and Wikipedia headings, ranging from 15 – 4 k tokens. The diversity mitigated overfitting to a single genre.
  • MR Generation: 273 metamorphic relations (MRs) automatically instantiated from 13 perturbation functions and five MR templates. Of these, 240 targeted robustness, 21 examined fairness across 21 demographic groups, and 6 assessed non-determinism and efficiency.
  • API Interactions: Each MR required two inference calls (original and perturbed) per target LLM. Across three models, this yielded ~42 000 calls, consuming ~19 150 000 tokens. To reduce variance, we repeated the entire MR set five times with fresh session contexts and randomised input order.
  • Metrics: We computed the Attack Success Rate (ASR) as the proportion of MRs that failed to satisfy the expected relational property. For fairness, we aggregated ASR across demographic subsets. To capture semantic fidelity, we introduced the Effectiveness Metric (EFM) = M‑ASR × PerturbQuality, where PerturbQuality is a cosine-based semantic similarity score.

RQ1 — Do the MRs reveal quality risks?

Across all three models, MRs exposed systematic vulnerabilities in each quality attribute. For robustness, PaLM achieved the lowest ASR (0.07), indicating that it misbehaves on only 7 % of perturbed inputs, whereas Llama 2 exhibited the highest ASR (0.32) except in the information-retrieval task where PaLM’s ASR rose to 0.15. Fairness results were model-specific: PaLM’s ASR on toxicity detection (TD) was lowest (0.04), ChatGPT’s on sentiment analysis (SA) was similarly low (0.06), and Llama 2 consistently reported the highest ASR (0.28) across all tasks, revealing pronounced demographic bias. Non-determinism was negligible for PaLM (variance < 1 ms) but comparable between ChatGPT and Llama 2 (~5 ms). Efficiency analyses showed PaLM’s inference times were the most stable (±2 % across repeats), whereas Llama 2’s latency varied by 2 000 s in worst-case generative scenarios, highlighting a scalability concern.

RQ2 — Which MRs are most effective?

EFM analysis pinpointed distinct MR families as most potent for each task. ConvertToI33tFormat—a semantic-altering perturbation—yielded the highest EFM for toxicity detection, news classification, and text summarization, underscoring PaLM’s sensitivity to orthographic noise. ShuffleCharacter and SwapCharacter (character-level MRs) dominated in sentiment analysis, revealing that fine-grained token order can sway polarity predictions. For generative tasks (question-answering, summarization), word-level MRs such as ReplaceSynonym and AddRandomWord produced the strongest signals, suggesting that context-aware semantic drift challenges model consistency. Sentence-level ReplaceRandomSentence emerged as particularly effective for information retrieval, indicating that paraphrasing at the discourse level can disrupt retrieval ranking. Shapley-value decomposition further confirmed that character-level MRs were most influential for classification, whereas word-level MRs held sway for generative outputs.

RQ3 — Feasibility of LLM-as-Tester

We evaluated whether an LLM could generate high-quality perturbations against another LLM. ChatGPT-generated MRs achieved the highest EFM across all target models, outperforming both hand-crafted perturbations and PaLM-generated MRs. This demonstrates that a conversational LLM can serve as a competent tester, autonomously producing adversarial inputs that uncover latent weaknesses. PaLM-generated MRs were also effective but lagged behind ChatGPT, likely due to differences in prompt engineering and token budget. The cross-examination paradigm thus validates the "LLM-as-Tester" concept, opening avenues for continuous, self-auditing language-model ecosystems.

In sum, our experimental evaluation confirms that METAL’s automated MR generation uncovers actionable quality risks, that distinct MR families target specific attributes, and that state-of-the-art LLMs can reliably act as test generators for other models.

Implications & Future Work

Practical Implications for ML Engineers and Industry

We demonstrate that METAL can be deployed against any fine-tuned LLM without labeled data. A 900-text corpus (15 – 4 k tokens) generates 273 MRs and triggers roughly 42 000 API calls per target model, covering 19 M tokens over five repetitions. The entire pipeline finishes in under an hour on a single 16 GB RAM machine, making it suitable for continuous-integration workflows. Engineers can insert a METAL scan after each fine-tuning iteration to quantify robustness, fairness, non-determinism, and efficiency; the ASR and EFM metrics provide actionable thresholds that can be integrated into release gates.

Expanding MR Coverage

We plan to broaden the MR taxonomy along two axes. First, we will add new quality attributes such as explainability, security (e.g., prompt-injection resistance), and compliance (e.g., data-privacy constraints). For explainability, Counterfactual Equivalence MRs will enforce consistency of model explanations under input perturbations. For security, Adversarial Prompt MRs will confirm that small prompt changes do not trigger policy violations. Second, we will incorporate additional tasks—retrieval-augmented generation, dialogue systems, and multimodal LLMs—by defining task-specific MR templates that capture expected output relations.

Prompt-Perturbation MRs

Future work will target the prompt itself, creating a new class of MRs. Prompt-Swap MRs replace an instruction with a semantically equivalent alternative and verify output consistency, testing prompt interpretability. Prompt-Length MRs incrementally extend or truncate a prompt while preserving the core instruction, then check that the output remains within an acceptable similarity window. We will generate these automatically by prompting a secondary LLM to produce paraphrased or shortened prompts, creating a "prompt-as-tester" loop.

Optimizing MR Generation

Not all MRs contribute equally to a QA metric. We will apply data-driven optimization to identify high-impact MRs. Bayesian optimization over perturbation parameters (e.g., number of synonym replacements) will maximize EFM. Reinforcement-learning agents will generate perturbations conditioned on a target QA, rewarding the resulting ASR weighted by semantic similarity. Shapley-value analysis on a larger MR set will rank individual perturbations by marginal contribution, guiding a curriculum-style MR selection. The result will be a compact, high-yield MR library that reduces runtime while preserving detection power.

Open-Source Roadmap

We plan to maintain the METAL repository as a living ecosystem. A clear contribution guide will enable researchers to submit new MR templates, perturbation functions, and tasks. A scheduled GitHub workflow will run METAL against the latest GPT-4, Claude, and open-source LLMs, publishing a public leaderboard. Educational Jupyter notebooks will illustrate the end-to-end process, enabling students to experiment with MR generation and evaluation.

Conclusion

In this work, we introduced METAL, a systematic metamorphic testing framework that automatically generates hundreds of metamorphic relations (MRs) to evaluate key quality attributes of large language models (LLMs). By leveraging a diverse set of MR templates—Equivalence, Discrepancy, Set-Equivalence, Distance, and Set-Distance—alongside a library of 13 perturbation functions, we were able to assess robustness, fairness, non-determinism, and efficiency across six core LLM tasks: toxicity detection, sentiment analysis, news classification, question-answering, summarization, and information retrieval.

The evaluation pipeline consists of three tightly coupled stages. First, the Execution Module produces perturbed inputs from a corpus of 900 unlabelled texts and records the outputs of target LLMs. Second, the Evaluation Module applies the MR templates to the logged outputs, yielding binary pass/fail signals that we aggregate into the Attack Success Rate (ASR). Third, we refine ASR with a semantic-similarity score—derived from cosine similarity on sentence embeddings—to compute the Effectiveness Metric (EFM), which balances raw failure rates with perturbation quality.

Our experiments on Google PaLM, OpenAI’s ChatGPT, and Meta’s Llama 2 reveal consistent patterns. PaLM consistently achieves the lowest ASR for robustness and fairness, indicating superior resilience to both semantic-preserving and altering perturbations. Conversely, Llama 2 displays the highest ASR in most tasks, suggesting heightened vulnerability. ChatGPT’s performance is task-dependent: it excels in generating highly effective MRs via self- and cross-examination, achieving the highest EFM across all LLMs in several tasks.

The scale of our MR generation—2 730 MRs in total, with 240 dedicated to robustness—enabled a comprehensive exploration of perturbation spaces. Each MR was instantiated five times with different seeds, resulting in approximately 42,000 API requests per LLM and 19.15 million tokens processed. This breadth uncovered subtle failure modes, such as the variance in inference latency for Llama 2 (≈1,500 s to +2,000 s) and non-deterministic behavior in PaLM during information retrieval.

Self-examination, wherein an LLM generates perturbations for another LLM, proved viable. ChatGPT-generated MRs achieved high EFM against all three target models, demonstrating that an LLM can act as a tester for another. This "LLM-as-tester" paradigm opens avenues for automated quality assurance without human-crafted test cases.

From a practical standpoint, METAL requires no labeled data and operates on any fine-tuned LLM, making it suitable for industry deployments and small-business evaluations. Future work will expand MR coverage to additional quality attributes—such as explainability and security—integrate prompt-perturbation MRs, and develop optimization techniques for generating even more effective MR sets. A user-friendly CLI and visualization dashboard will further lower the barrier to adoption.

In summary, METAL delivers a scalable, reproducible, and open framework for assessing LLM quality attributes, providing actionable insights for developers, researchers, and practitioners alike.