Evaluation and benchmarks

The evaluation part needs improvements.
The benchmarks that should be tested against first : 
- cybench
- NYU CTF
- validation benchmarks.

The models that should be tested : 
_- GPT o3- and reasoning models._
_- Anthropic's sonnet._
_- Gemini_
_- A smaller open source model._ 

## Validation benchmarks by XBOW
We need to add scripting, that changes the repos information to the `eval_metadata_file.json` format taken by the deadend-agent. 
Adding scripting to run the challenge and get the url and make it accessible in the eval file so that it could be taken into account by the agent.

## metrics
We need specific metrics that defines precise information. 
- how much did it cost.
- what model was used.
- how much time.
- how much token (input and ouput)
- ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation and benchmarks #12

Validation benchmarks by XBOW

metrics

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluation and benchmarks #12

Description

Validation benchmarks by XBOW

metrics

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions