reporting: script to check report quality#1523
Conversation
erickgalinkin
left a comment
There was a problem hiding this comment.
Mostly looks good. A couple of fixes needed to make it actually run and a few value fixes.
| if _is_dev_version(garak_version): | ||
| add_note( | ||
| f"report generated under development garak version {garak_version}, implementation will depend on branch+commit" | ||
| ) |
There was a problem hiding this comment.
Not me triggering this constantly.
|
NB. add documentation to match pattern in #1569 |
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Erick Galinkin <erick.galinkin@gmail.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
jmartin-tech
left a comment
There was a problem hiding this comment.
Reviewing the code and the list of things checked has raised questions for me about the goal of this utility. I suspect naming is the root cause. When I think of report integrity I think of the internal integrity unto itself, validating the information contained is full and complete unto internally noted criteria. This aligns with some of the noted checks, such as validating all requested in initial config have attempts, eval, summary data and ensuring the counts involved present internal consistency with evidence that backs them.
However, this script seems to be looking for more than internal consistency, it looks to be expressing something I akin more it usability in context or suitability for a purpose, such as considering the result current enough to be testing state of the art, by raising concerns for things like age and version and dev version.
Can we find a name that is more clear?
| * ✔️ report using dev version | ||
| * ✔️ current version is dev version | ||
| * ✔️ probe_spec matches probes in attempts | ||
| * ✔️ attempt status 1 has matching status 2 |
There was a problem hiding this comment.
This seems like a possible false positive item, there are probes that emit status 1 entires with multiple status 2 entries, TreeSearchProbe and possibly IterativeProbe may or may not present in ways that match inferred expectations of this check.
There was a problem hiding this comment.
two situations are covered here:
a. asserting that there's >0 status 2 attempt for every status 1 attempt
b. asserting that there's only 1 status 2 attempt for every status 1 attempt
Scenario (a) should not be failed - that means a broken run
We should go for (b) as well, I think. What patterns are there for going generation but no detection? Can we afford these a new, distinct status?
There was a problem hiding this comment.
I think both have the possibility of not being true currently for a valid run:
(a) may not apply to all iterative probe techniques depending on how the states are managed as some inference calls may be part of the build up of the technique and intentional logging of all calls is part of ensuring a proper audit trail. Iterative termination logic could cause some attempts to intentionally not be processed by the final detection phase, though I am not sure it should
(b) I suspect this already is not true for TreeSearchProbe. Those probes will run detection at each fork that uses a different detector than the probe's evaluation detector. A similar pattern is expected in many techniques I suspect will implement IterativeProbe as the iteration process often needs to perform a detector evaluation of some sort to determine if iteration should continue.
Your suggestion of another state is a reasonable thing to explore and enable making these criteria targets once the state is defined and implemented, they just aren't currently something reports will conform to today.
There was a problem hiding this comment.
Indeed - this tool reveals a bug in TreeSearchProbe as far as I'm concerned. Attempt with complete detection (status 1) shouldn't appear more than once for a given target result (status 1).
Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Co-authored-by: Patricia Pampanelli <38949950+patriciapampanelli@users.noreply.github.com> Signed-off-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
| probes_requested, __rejected = garak._config.parse_plugin_spec( | ||
| configured_probe_spec, "probes" | ||
| ) |
There was a problem hiding this comment.
I suspect this is going to be problematic, parsing an older spec with the current codebase will often see divergence. Consider a probe that has been renamed or even deprecated and removed.
Testing shows based on this restriction the output will generate a high priority entry for each attempt from the renamed or removed probe class.
This also presents a divergence from expectations when a new probe is added or activated for an existing module between versions, the spec parsing will expect the newly active probe to be in found in the report.
This indicates to me that the report may need to have an expanded activated probes list stored in the report during the run to provide a way to validate this value consistently using just the report jsonl.
| if ( | ||
| total_attempts_processed | ||
| != attempt_status_2_per_probe[_probename] | ||
| * generations_requested | ||
| ): | ||
| add_note( | ||
| f"Eval entry for {_probename} {_detectorname} indicates {total_attempts_processed} instances but there were {attempt_status_2_per_probe[_probename]} status:2 attempts (generations={generations_requested})" | ||
| ) |
There was a problem hiding this comment.
This presents a consistency issue, there are known probes that will not fit this pattern, atkgen.Tox and topic.WordnetControversial are clear examples identified in testing.
There was a problem hiding this comment.
agree. made this low prio.
There was a problem hiding this comment.
-- splitting this into conditions of "too many" and "too few". also will re-check the * generations logic, it seems odd that generations would directly change the proportion of status-2 to status-1
Conduct a variety of checks and tests to assess the integrity of a garak
report.jsonlfileThis helps us identify where a report may be broken, deficient, or incorrectly assembled
Inventory of tests:
probe_specmatches probes present in attemptspassed+nonesis not more thantotalreponses graded inevalentries