-
Notifications
You must be signed in to change notification settings - Fork 633
[DRAFT] FEAT Dataset Loading Utilities #1201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT license. | ||
|
|
||
| """Dataset classes for scenario data loading.""" | ||
|
|
||
| from pyrit.scenario.dataset.load_utils import ScenarioDatasetUtils | ||
|
|
||
| __all__ = [ | ||
| "ScenarioDatasetUtils" | ||
| ] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT license. | ||
|
|
||
| from pathlib import Path | ||
| from typing import List | ||
| from pyrit.models import SeedDataset | ||
| from pyrit.common.path import DATASETS_PATH, SCORER_CONFIG_PATH | ||
| from pyrit.datasets.harmbench_dataset import fetch_harmbench_dataset | ||
|
|
||
|
|
||
| class ScenarioDatasetUtils: | ||
| """ | ||
| Set of dataset loading utilities for Scenario class. | ||
| """ | ||
| @classmethod | ||
| def seed_dataset_to_list_str(cls, dataset: Path) -> List[str]: | ||
| seed_prompts: List[str] = [] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm kind of surprised we're using these as plain strings. It loses all the metadata. That means we lose harm categories, for example. How will one query for the results? |
||
| seed_prompts.extend(SeedDataset.from_yaml_file(dataset).get_values()) | ||
| return seed_prompts | ||
|
|
||
| @classmethod | ||
| def get_seed_dataset(cls, which: str) -> SeedDataset: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which is not a common parameter naming choice. Name seems preferable. |
||
| """ | ||
| Get SeedDataset from shorthand string. | ||
| Args: | ||
| which (str): Which SeedDataset. | ||
| Returns: | ||
| SeedDataset: Desired dataset. | ||
| Raises: | ||
| ValueError: If dataset not found. | ||
| """ | ||
| match which: | ||
| case "harmbench": | ||
| return fetch_harmbench_dataset() | ||
| case _: | ||
| raise ValueError(f"Error: unknown dataset `{which}` provided.") | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT license. | ||
|
|
||
| """Tests for the scenarios.ScenarioDatasetUtils class.""" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good stab at the problem. But I think the route I prefer to go is to make everything really easy to put in the database (e.g. include initializer that load all the scenario datasets) and then just have the scenarios grab from the database.