Skip to content

Evaluation and benchmarks #12

@xoxruns

Description

@xoxruns

The evaluation part needs improvements.
The benchmarks that should be tested against first :

  • cybench
  • NYU CTF
  • validation benchmarks.

The models that should be tested :
- GPT o3- and reasoning models.
- Anthropic's sonnet.
- Gemini
- A smaller open source model.

Validation benchmarks by XBOW

We need to add scripting, that changes the repos information to the eval_metadata_file.json format taken by the deadend-agent.
Adding scripting to run the challenge and get the url and make it accessible in the eval file so that it could be taken into account by the agent.

metrics

We need specific metrics that defines precise information.

  • how much did it cost.
  • what model was used.
  • how much time.
  • how much token (input and ouput)
  • ...

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions