reporting: script to check report quality by leondz · Pull Request #1523 · NVIDIA/garak

leondz · 2025-12-10T11:51:53Z

Conduct a variety of checks and tests to assess the integrity of a garak report.jsonl file

This helps us identify where a report may be broken, deficient, or incorrectly assembled

Inventory of tests:

✔️ garak version match between that used to create report and current garak used for checking
✔️ report using a dev version of garak
✔️ check using a dev version of garak
✔️ inventory described by config's probe_spec matches probes present in attempts
✔️ each attempt status 1 has matching status 2
✔️ all attempts have enough unique generations
✔️ run ID is in setup run IDs
✔️ detection output has correct cardinality in attempt status 2s
✔️ a summary digest object is present
✔️ at least one z-score is listed in the digest
✔️ probes present in summary matches probes requested in config
✔️ the run was completed
✔️ the run is <6 months old (calibration freshness)
✔️ there is at least one eval statement for any probe attempted
✔️ evals are performed over all status 2 attempts
✔️ number of responses graded passed + nones is not more than total reponses graded in eval entries

…oken output pipe

… and totals

erickgalinkin

Mostly looks good. A couple of fixes needed to make it actually run and a few value fixes.

erickgalinkin · 2026-01-27T18:19:18Z

+                    if _is_dev_version(garak_version):
+                        add_note(
+                            f"report generated under development garak version {garak_version}, implementation will depend on branch+commit"
+                        )


Not me triggering this constantly.

leondz · 2026-01-27T20:32:39Z

NB. add documentation to match pattern in #1569

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

jmartin-tech

Reviewing the code and the list of things checked has raised questions for me about the goal of this utility. I suspect naming is the root cause. When I think of report integrity I think of the internal integrity unto itself, validating the information contained is full and complete unto internally noted criteria. This aligns with some of the noted checks, such as validating all requested in initial config have attempts, eval, summary data and ensuring the counts involved present internal consistency with evidence that backs them.

However, this script seems to be looking for more than internal consistency, it looks to be expressing something I akin more it usability in context or suitability for a purpose, such as considering the result current enough to be testing state of the art, by raising concerns for things like age and version and dev version.

Can we find a name that is more clear?

jmartin-tech · 2026-02-20T22:10:25Z

+* ✔️ report using dev version 
+* ✔️ current version is dev version
+* ✔️ probe_spec matches probes in attempts
+* ✔️ attempt status 1 has matching status 2


This seems like a possible false positive item, there are probes that emit status 1 entires with multiple status 2 entries, TreeSearchProbe and possibly IterativeProbe may or may not present in ways that match inferred expectations of this check.

two situations are covered here:

a. asserting that there's >0 status 2 attempt for every status 1 attempt
b. asserting that there's only 1 status 2 attempt for every status 1 attempt

Scenario (a) should not be failed - that means a broken run

We should go for (b) as well, I think. What patterns are there for going generation but no detection? Can we afford these a new, distinct status?

I think both have the possibility of not being true currently for a valid run:

(a) may not apply to all iterative probe techniques depending on how the states are managed as some inference calls may be part of the build up of the technique and intentional logging of all calls is part of ensuring a proper audit trail. Iterative termination logic could cause some attempts to intentionally not be processed by the final detection phase, though I am not sure it should

(b) I suspect this already is not true for TreeSearchProbe. Those probes will run detection at each fork that uses a different detector than the probe's evaluation detector. A similar pattern is expected in many techniques I suspect will implement IterativeProbe as the iteration process often needs to perform a detector evaluation of some sort to determine if iteration should continue.

Your suggestion of another state is a reasonable thing to explore and enable making these criteria targets once the state is defined and implemented, they just aren't currently something reports will conform to today.

Indeed - this tool reveals a bug in TreeSearchProbe as far as I'm concerned. Attempt with complete detection (status 1) shouldn't appear more than once for a given target result (status 1).

Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

jmartin-tech · 2026-03-31T21:27:05Z

+                    probes_requested, __rejected = garak._config.parse_plugin_spec(
+                        configured_probe_spec, "probes"
+                    )


I suspect this is going to be problematic, parsing an older spec with the current codebase will often see divergence. Consider a probe that has been renamed or even deprecated and removed.

Testing shows based on this restriction the output will generate a high priority entry for each attempt from the renamed or removed probe class.

This also presents a divergence from expectations when a new probe is added or activated for an existing module between versions, the spec parsing will expect the newly active probe to be in found in the report.

This indicates to me that the report may need to have an expanded activated probes list stored in the report during the run to provide a way to validate this value consistently using just the report jsonl.

jmartin-tech · 2026-03-31T21:30:49Z

+                        if (
+                            total_attempts_processed
+                            != attempt_status_2_per_probe[_probename]
+                            * generations_requested
+                        ):
+                            add_note(
+                                f"Eval entry for {_probename} {_detectorname} indicates {total_attempts_processed} instances but there were {attempt_status_2_per_probe[_probename]} status:2 attempts (generations={generations_requested})"
+                            )


This presents a consistency issue, there are known probes that will not fit this pattern, atkgen.Tox and topic.WordnetControversial are clear examples identified in testing.

agree. made this low prio.

-- splitting this into conditions of "too many" and "too few". also will re-check the * generations logic, it seems odd that generations would directly change the proportion of status-2 to status-1

leondz added 3 commits December 10, 2025 12:12

add script to assess integrity of a garak report.jsonl file

d080525

add checks of digest object

da4faa7

update todo in descr

dc6fbcf

leondz requested a review from jmartin-tech December 10, 2025 11:51

leondz added the reporting Reporting, analysis, and other per-run result functions label Dec 10, 2025

leondz added 6 commits December 11, 2025 15:26

clarify compare-based reporting and some note formulations; handle br…

08503c9

…oken output pipe

add generations logic; match current eval-entry use of 'total'

57d08ec

Merge branch 'main' into reporting/report_integrity

f8c66da

comply with new eval entry format; add more checks around eval counts…

69d9d7a

… and totals

Merge branch 'main' into reporting/report_integrity

16af126

Merge branch 'main' into reporting/report_integrity

59a15b1

leondz requested review from aishwaryap and erickgalinkin January 27, 2026 12:53

erickgalinkin requested changes Jan 27, 2026

View reviewed changes

leondz and others added 9 commits January 28, 2026 09:46

complete type descrs

56f93d5

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

undouble doublequote

dc45518

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

check report version for report version note

a2759c7

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

check report version for report version note

a79d363

Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

add report integrity check doc

0ca75d7

add license

877eaa9

rm docstring unindent

136d0a6

automatic garak/resources/plugin_cache.json update

5ca8740

clarify/expand note messages, including attempt seq and adjusting words

e3d5ab1

leondz requested a review from erickgalinkin January 28, 2026 09:29

merge main

be11b9a

leondz assigned jmartin-tech, patriciapampanelli and erickgalinkin and unassigned jmartin-tech Feb 19, 2026

leondz added this to the 0.14.1 milestone Feb 19, 2026

patriciapampanelli reviewed Feb 20, 2026

View reviewed changes

Comment thread garak/analyze/check_report_integrity.py Outdated

patriciapampanelli reviewed Feb 20, 2026

View reviewed changes

Comment thread garak/analyze/check_report_integrity.py Outdated

jmartin-tech reviewed Feb 23, 2026

View reviewed changes

leondz and others added 3 commits March 16, 2026 23:44

populate pfn correctly

d8d8e55

Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

use total_attempts_processed for status:2 consistency check

442a68f

Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>

rename integrity->quality check; add pip variants for priority

0603f59

leondz requested a review from jmartin-tech March 24, 2026 20:53

jmartin-tech requested changes Mar 30, 2026

View reviewed changes

Comment thread garak/analyze/check_report_quality.py Outdated

it's OK if _summary not in probes in digest

10fc21b

jmartin-tech changed the title ~~reporting: script to check report integrity~~ reporting: script to check report quality Mar 31, 2026

finalize utility rename in docs

e164c26

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

jmartin-tech reviewed Mar 31, 2026

View reviewed changes

leondz mentioned this pull request Apr 1, 2026

Handle unrecognized probes and detectors in report digest generation #1663

Open

split handling of more, fewer status:2 than expected

13c41f3

leondz removed this from the 0.14.1 milestone Apr 2, 2026

disambiguate attempt count and response count

03f22cc

Conversation

leondz commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erickgalinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leondz commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

jmartin-tech Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leondz commented Dec 10, 2025 •

edited

Loading

jmartin-tech Feb 20, 2026 •

edited

Loading