Skip to content

reporting: script to check report quality#1523

Open
leondz wants to merge 26 commits intoNVIDIA:mainfrom
leondz:reporting/report_integrity
Open

reporting: script to check report quality#1523
leondz wants to merge 26 commits intoNVIDIA:mainfrom
leondz:reporting/report_integrity

Conversation

@leondz
Copy link
Copy Markdown
Collaborator

@leondz leondz commented Dec 10, 2025

Conduct a variety of checks and tests to assess the integrity of a garak report.jsonl file

This helps us identify where a report may be broken, deficient, or incorrectly assembled

Inventory of tests:

  • ✔️ garak version match between that used to create report and current garak used for checking
  • ✔️ report using a dev version of garak
  • ✔️ check using a dev version of garak
  • ✔️ inventory described by config's probe_spec matches probes present in attempts
  • ✔️ each attempt status 1 has matching status 2
  • ✔️ all attempts have enough unique generations
  • ✔️ run ID is in setup run IDs
  • ✔️ detection output has correct cardinality in attempt status 2s
  • ✔️ a summary digest object is present
  • ✔️ at least one z-score is listed in the digest
  • ✔️ probes present in summary matches probes requested in config
  • ✔️ the run was completed
  • ✔️ the run is <6 months old (calibration freshness)
  • ✔️ there is at least one eval statement for any probe attempted
  • ✔️ evals are performed over all status 2 attempts
  • ✔️ number of responses graded passed + nones is not more than total reponses graded in eval entries

@leondz leondz requested a review from jmartin-tech December 10, 2025 11:51
@leondz leondz added the reporting Reporting, analysis, and other per-run result functions label Dec 10, 2025
Copy link
Copy Markdown
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. A couple of fixes needed to make it actually run and a few value fixes.

Comment thread garak/analyze/check_report_integrity.py Outdated
Comment thread garak/analyze/check_report_integrity.py Outdated
Comment on lines +121 to +124
if _is_dev_version(garak_version):
add_note(
f"report generated under development garak version {garak_version}, implementation will depend on branch+commit"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not me triggering this constantly.

Comment thread garak/analyze/check_report_integrity.py Outdated
Comment thread garak/analyze/check_report_integrity.py Outdated
Comment thread garak/analyze/check_report_integrity.py Outdated
@leondz
Copy link
Copy Markdown
Collaborator Author

leondz commented Jan 27, 2026

NB. add documentation to match pattern in #1569

leondz and others added 9 commits January 28, 2026 09:46
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
@leondz leondz requested a review from erickgalinkin January 28, 2026 09:29
@leondz leondz added this to the 0.14.1 milestone Feb 19, 2026
Comment thread garak/analyze/check_report_integrity.py Outdated
Comment thread garak/analyze/check_report_integrity.py Outdated
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the code and the list of things checked has raised questions for me about the goal of this utility. I suspect naming is the root cause. When I think of report integrity I think of the internal integrity unto itself, validating the information contained is full and complete unto internally noted criteria. This aligns with some of the noted checks, such as validating all requested in initial config have attempts, eval, summary data and ensuring the counts involved present internal consistency with evidence that backs them.

However, this script seems to be looking for more than internal consistency, it looks to be expressing something I akin more it usability in context or suitability for a purpose, such as considering the result current enough to be testing state of the art, by raising concerns for things like age and version and dev version.

Can we find a name that is more clear?

* ✔️ report using dev version
* ✔️ current version is dev version
* ✔️ probe_spec matches probes in attempts
* ✔️ attempt status 1 has matching status 2
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a possible false positive item, there are probes that emit status 1 entires with multiple status 2 entries, TreeSearchProbe and possibly IterativeProbe may or may not present in ways that match inferred expectations of this check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two situations are covered here:

a. asserting that there's >0 status 2 attempt for every status 1 attempt
b. asserting that there's only 1 status 2 attempt for every status 1 attempt

Scenario (a) should not be failed - that means a broken run

We should go for (b) as well, I think. What patterns are there for going generation but no detection? Can we afford these a new, distinct status?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both have the possibility of not being true currently for a valid run:

(a) may not apply to all iterative probe techniques depending on how the states are managed as some inference calls may be part of the build up of the technique and intentional logging of all calls is part of ensuring a proper audit trail. Iterative termination logic could cause some attempts to intentionally not be processed by the final detection phase, though I am not sure it should

(b) I suspect this already is not true for TreeSearchProbe. Those probes will run detection at each fork that uses a different detector than the probe's evaluation detector. A similar pattern is expected in many techniques I suspect will implement IterativeProbe as the iteration process often needs to perform a detector evaluation of some sort to determine if iteration should continue.

Your suggestion of another state is a reasonable thing to explore and enable making these criteria targets once the state is defined and implemented, they just aren't currently something reports will conform to today.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - this tool reveals a bug in TreeSearchProbe as far as I'm concerned. Attempt with complete detection (status 1) shouldn't appear more than once for a given target result (status 1).

Comment thread garak/analyze/check_report_quality.py
Comment thread garak/analyze/check_report_quality.py
Comment thread garak/analyze/check_report_quality.py
leondz and others added 3 commits March 16, 2026 23:44
Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com>
Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
@leondz leondz requested a review from jmartin-tech March 24, 2026 20:53
Comment thread garak/analyze/check_report_quality.py Outdated
@jmartin-tech jmartin-tech changed the title reporting: script to check report integrity reporting: script to check report quality Mar 31, 2026
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Comment on lines +143 to +145
probes_requested, __rejected = garak._config.parse_plugin_spec(
configured_probe_spec, "probes"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is going to be problematic, parsing an older spec with the current codebase will often see divergence. Consider a probe that has been renamed or even deprecated and removed.

Testing shows based on this restriction the output will generate a high priority entry for each attempt from the renamed or removed probe class.

This also presents a divergence from expectations when a new probe is added or activated for an existing module between versions, the spec parsing will expect the newly active probe to be in found in the report.

This indicates to me that the report may need to have an expanded activated probes list stored in the report during the run to provide a way to validate this value consistently using just the report jsonl.

Comment thread garak/analyze/check_report_quality.py Outdated
Comment on lines +216 to +223
if (
total_attempts_processed
!= attempt_status_2_per_probe[_probename]
* generations_requested
):
add_note(
f"Eval entry for {_probename} {_detectorname} indicates {total_attempts_processed} instances but there were {attempt_status_2_per_probe[_probename]} status:2 attempts (generations={generations_requested})"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This presents a consistency issue, there are known probes that will not fit this pattern, atkgen.Tox and topic.WordnetControversial are clear examples identified in testing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. made this low prio.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- splitting this into conditions of "too many" and "too few". also will re-check the * generations logic, it seems odd that generations would directly change the proportion of status-2 to status-1

@leondz leondz removed this from the 0.14.1 milestone Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reporting Reporting, analysis, and other per-run result functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants