-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
enhancementNew feature or requestNew feature or request
Description
The evaluation part needs improvements.
The benchmarks that should be tested against first :
- cybench
- NYU CTF
- validation benchmarks.
The models that should be tested :
- GPT o3- and reasoning models.
- Anthropic's sonnet.
- Gemini
- A smaller open source model.
Validation benchmarks by XBOW
We need to add scripting, that changes the repos information to the eval_metadata_file.json format taken by the deadend-agent.
Adding scripting to run the challenge and get the url and make it accessible in the eval file so that it could be taken into account by the agent.
metrics
We need specific metrics that defines precise information.
- how much did it cost.
- what model was used.
- how much time.
- how much token (input and ouput)
- ...
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request