FuzzCheck.AI/codebook.json at main · itsmahbub/FuzzCheck.AI · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
{
  "Failure Severity": {
    "description": "Failures uncovered through DNN fuzzing vary widely in their severity and real-world impact. Some failures correspond to model robustness errors, such as mispredictions or prediction inconsistencies under semantically preserving input mutations, while others involve undesired or unsafe behaviors, or violations of explicit safety and security mechanisms. These different classes of failures carry very different implications for model reliability, safety, and security. We therefore include failure severity as an evaluation metric to assess the severity and impact of the failures a DNN fuzzer can uncover, distinguishing fuzzers that expose only low-impact robustness errors from those capable of revealing high-impact behavioral and security failures. To evaluate DNN fuzzing papers on this metric, we examine the types of failures the fuzzer is designed to discover. Specifically, we assess whether the fuzzer uncovers failures that violate user intent or safety expectations, producing undesired or unsafe outputs such as hallucinated, biased, toxic, or otherwise unsafe responses. We also examine whether the fuzzer uncovers failures in models equipped with explicit defense mechanisms, such as adversarially trained or safety-aligned models, by bypassing safety, policy, or security safeguards to induce high-severity violations (e.g., jailbreaks, data leakage, or unauthorized actions).",
    "values": {
      "High": "Uncovers high-impact failures by bypassing explicit safety, policy, or security mechanisms in defended or safety-aligned models, inducing violations such as jailbreaks, data leakage, unauthorized actions, or other security-critical behaviors.",
      "Medium": "Uncovers failures that violate user intent or safety expectations, producing undesired outputs such as hallucinated, biased, toxic, or otherwise unsafe behavior, without bypassing explicit safety or security mechanisms.",
      "Low": "Uncovers only model robustness errors, such as mispredictions or prediction inconsistencies under semantically preserving input mutations, without exposing unsafe behavior or bypassing safety or security mechanisms."
    }
  },
  "Targeted Attack Discovery": {
    "description": "Many adversarial threat models define the attacker's goal as modifying inputs to induce specific attacker-chosen outputs (e.g., unsafe content labeled as safe or speech transcribed to chosen commands). In this context, targeted attack discovery captures whether and how a DNN fuzzer steers its exploration toward predefined target outcomes, as opposed to discovering faults without an explicit output objective. We therefore include the targeted attack discovery metric to assess whether DNN fuzzers pursue this security testing objective, and to examine which fuzzing design choices enable or limit the discovery of such targeted attacks. To evaluate DNN fuzzing papers on this metric, we examine whether the fuzzing design supports targeted attack discovery and how it steers exploration toward targeted failures. Specifically, we assess whether the fuzzer aims to discover fault-triggering inputs that induce attacker-chosen outputs (e.g., specific labels, phrases, or prohibited responses), broader classes of security-violating behavior (e.g., jailbreaks, toxicity, unsafe behaviors), or performs untargeted exploration that discovers generic faults such as misclassifications or prediction inconsistencies. To support consistent evaluation under this metric, we examine how the fuzzer's mutation, exploration, and oracle components contribute to or constrain its ability to steer exploration toward targeted outcomes.",
    "values": {
      "High": "Supports targeted attack discovery by steering exploration toward specific, predefined target outcomes (e.g., inducing a classifier to predict a chosen label, or a speech model to output a particular phrase or command).",
      "Medium": "Supports discovery of broader classes of security-violating behaviors (e.g., jailbreaks, toxicity, bias, or other unsafe behaviors), rather than a specific model output.",
      "Low": "Performs untargeted exploration, revealing generic failures such as misclassifications, inaccuracies, or inconsistent predictions."
    }
  },
  "Input Plausibility": {
    "description": "Failures induced by unrealistic or semantically invalid inputs may trigger incorrect model behavior, but they fall outside realistic  threat models. For instance, a stop sign that is slightly brightened or rotated remains visually plausible within the driving domain, whereas one corrupted by heavy random noise or visually incoherent textures does not. Similarly, speech containing mild background hiss remains intelligible and natural, while severely distorted or robotic utterances fall outside realistic speech conditions. This metric evaluates how such plausibility constraints are incorporated into the fuzzing design, and whether discovered failures are induced by inputs that remain realistic within the task domain. In DNN fuzzing, fault-revealing inputs are typically produced through iterative, input-centric mutation of seed inputs. While individual mutation steps in DNN fuzzers often constrain perturbations relative to their immediate parent input (e.g., via bounded perturbation magnitude), such perturbations can accumulate across fuzzing iterations, causing the mutated input to drift substantially from the original seed. As a result, the final fault-inducing input generated by the fuzzer may exhibit visible or audible artifacts that no longer resemble realistic instances from the task domain. Without mechanisms that enforce input plausibility throughout fuzzing iterations, either by restricting mutations to a plausible input distribution or by filtering implausible candidates, fuzzing may surface failures that arise only under unrealistic input conditions. To evaluate DNN fuzzing papers on this metric, we examine whether mutation design ensures that mutated inputs remain within a plausible input distribution throughout fuzzing iterations.",
      "values": {
      "High": "Enforces input plausibility throughout fuzzing iterations by ensuring that mutated inputs remain within a plausible input distribution.",
      "Medium": "Enforces plausibility between successive mutation steps through bounded or rule-based constraints but overlooks cumulative effects across fuzzing iterations.",
      "Low": "Does not enforce input plausibility, allowing unrealistic or semantically invalid inputs to induce failures."
    }
  },
  "Failure Reproducibility": {
    "description": "During fuzzing, small numerical changes introduced by mutation may induce failures that manifest only in memory and disappear after standard I/O operations such as rounding, quantization, or encoding into common storage formats (e.g., PNG, JPEG, WAV, MP3). Failures that do not persist under such operations reflect fragile numerical artifacts rather than genuine model vulnerabilities. Using this metric, we assess whether failures discovered by DNN fuzzing remain reproducible after standard I/O operations. In DNN fuzzing, mutations are typically applied to floating-point tensor representations of inputs. As a result, fault-inducing perturbations introduced during mutation may exist only in memory and be lost when inputs are saved to disk. For example, an image pixel value may change from 155.0 to 155.4 after mutation; when serialized as an 8-bit image, this value is rounded to 155, eliminating the perturbation that triggered the failure. Similarly, if a mutation changes a pixel value from 254 to 256, it will be clipped to 255 during serialization, potentially altering the perturbation responsible for the observed failure. Small numerical perturbations, such as noise in image pixels or audio waveform amplitudes, are highly susceptible to loss during quantization, as shown in prior work. Semantic-preserving metamorphic transformations, such as rotation and brightness adjustment in images or speed and background noise modifications in audio, are widely used as mutation operators in DNN fuzzing. Unlike fine-grained numerical perturbations, these transformations introduce coarse, perceptible changes to the input while preserving semantic meaning, and therefore rely less on precise numerical values. However, our empirical analysis shows that a non-trivial fraction of failures discovered by metamorphic-transformation-based fuzzers do not persist after serialization when I/O effects are not explicitly considered. We therefore treat metamorphic mutations as providing only partial enforcement of failure reproducibility in the absence of explicit handling of serialization effects. To evaluate DNN fuzzing papers on this metric, we examine whether the fuzzing design explicitly accounts for I/O effects both during test case generation and when the oracle determines that an input induces a failure. In particular, we assess whether mutation operations constrain perturbations so that standard I/O operations do not remove or alter fault-inducing perturbations. We also consider fuzzing approaches that generate inputs directly in serialized formats (e.g., synthesized speech, rendered images, or generated text) as reproducible under this metric, since such inputs are inherently stable with respect to standard I/O operations.",
    "values": {
      "High": "Ensures that fault-inducing inputs remain reproducible under standard I/O operations by explicitly accounting for serialization effects (e.g., both clipping and quantization/rounding).",
      "Medium": "Ensures serialization effects partially, such as (i) applying semantic-preserving metamorphic transformations without explicit clipping or rounding, or (ii) applying fine-grained perturbations with either clipping or quantization/rounding, but not both.",
      "Low": "Ignores I/O effects completely, allowing fault-inducing perturbations to be lost or altered after standard serialization."
    }
  },
  "Failure Diagnostics": {
    "description": "Understanding why a model fails is an important objective in security testing. We include this metric to assess whether and how DNN fuzzing papers provide diagnostic analysis of discovered failures, beyond merely reporting failure cases. To evaluate DNN fuzzing papers on this metric, we examine whether a fuzzing paper provides diagnostic insight through analysis of observable model internals (e.g., model coverage or neuron characteristics such as activation frequency, layer type, and position). Such signals are widely used in DNN fuzzing to characterize model behavior and have been empirically associated with the discovery of erroneous behaviors. We also examine whether failures are analyzed through statistical failure patterns (e.g., class-level error concentration, distributional proximity to other classes, or correlations between failures and input structure), and whether such analysis connects discovered failures to broader sources of model vulnerability, such as non-robust or spurious features, dataset bias, or overfitting.",
    "values": {
      "High": "Provides diagnostic analysis that explains why failures occur by linking discovered failures to underlying model vulnerabilities (e.g., reliance on non-robust or spurious features, dataset bias, or overfitting).",
      "Medium": "Provides diagnostic insight through analysis of observable model internals (e.g., correlations between failures and model coverage or neuron characteristics), or through statistical analysis of failure patterns.",
      "Low": "Reports discovered failures but does not analyze model internals or failure patterns to provide diagnostic insight."
    }
  },
  "Attack Transferability": {
    "description": "Fault-inducing inputs crafted on a surrogate model often transfer across other models performing the same task, enabling substitute-model attacks against unseen targets without internal access. Such transferable attacks reveal shared vulnerability patterns across model implementations, rather than failures specific to a single model. We include the attack transferability metric to assess whether DNN fuzzing papers enable the discovery of failures that transfer across models. To evaluate DNN fuzzing papers on this metric, we examine whether fault-inducing inputs generated through fuzzing on one model are reused to test other models performing the same task, and whether the fuzzing design incorporates mechanisms that support the discovery of transferable security failures across models.",
    "values": {
      "High": "Fault-inducing inputs generated through fuzzing on one model are reused to test transferability to other models performing the same task, and the fuzzing design incorporates explicit mechanisms to support the discovery of transferable security failures across models.",
      "Medium": "Fault-inducing inputs generated through fuzzing on one model are reused to test transferability to other models, but the fuzzing design is not explicitly aimed at discovering transferable attacks.",
      "Low": "Does not demonstrate whether fault-inducing inputs discovered on one model also trigger failures in other models performing the same task."
    }
  }
}