Skip to content

Commit 5041ad3

Browse files
sjarmakclaude
andcommitted
refactor: trim to 275 canonical tasks, archive non-canonical to backups/
- Remove 5 protonmail/webclients SWE-bench Pro tasks (can't run on Daytona, score 0.0) - Move 156 non-canonical tasks from suite dirs to benchmarks/backups/ - Update unified_benchmark_manifest.json: 280 → 275 tasks - Update README.md and benchmarks/README.md with correct suite counts (131 SDLC + 144 Org) - Clean up related configs (ground_truth_files, mirror_creation_manifest, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 204b408 commit 5041ad3

File tree

3,021 files changed

+37862
-4107
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

3,021 files changed

+37862
-4107
lines changed

README.md

Lines changed: 54 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -75,37 +75,37 @@ Nine suites organized by software development lifecycle phase:
7575

7676
| Suite | SDLC Phase | Tasks | Description |
7777
|-------|-----------|------:|-------------|
78-
| `csb_sdlc_fix` | Bug Repair | 26 | Diagnosing and fixing real bugs across production codebases |
7978
| `csb_sdlc_feature` | Feature Implementation | 23 | New features, interface implementation, big-code features |
80-
| `csb_sdlc_debug` | Debugging & Investigation | 18 | Root cause tracing, fault localization, provenance |
81-
| `csb_sdlc_test` | Testing & QA | 18 | Code review, performance testing, code search validation, test generation |
82-
| `csb_sdlc_refactor` | Cross-File Refactoring | 16 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
83-
| `csb_sdlc_design` | Architecture & Design | 14 | Architecture analysis, dependency graphs, change impact |
84-
| `csb_sdlc_document` | Documentation | 13 | API references, architecture docs, migration guides, runbooks |
85-
| `csb_sdlc_secure` | Security & Compliance | 12 | CVE analysis, reachability, governance, access control |
86-
| `csb_sdlc_understand` | Requirements & Discovery | 10 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
87-
| **Total** | | **150** | |
79+
| `csb_sdlc_fix` | Bug Repair | 19 | Diagnosing and fixing real bugs across production codebases |
80+
| `csb_sdlc_refactor` | Cross-File Refactoring | 18 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
81+
| `csb_sdlc_debug` | Debugging & Investigation | 13 | Root cause tracing, fault localization, provenance |
82+
| `csb_sdlc_secure` | Security & Compliance | 13 | CVE analysis, reachability, governance, access control |
83+
| `csb_sdlc_test` | Testing & QA | 12 | Code review, performance testing, code search validation, test generation |
84+
| `csb_sdlc_design` | Architecture & Design | 11 | Architecture analysis, dependency graphs, change impact |
85+
| `csb_sdlc_document` | Documentation | 11 | API references, architecture docs, migration guides, runbooks |
86+
| `csb_sdlc_understand` | Requirements & Discovery | 11 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
87+
| **Total** | | **131** | |
8888

8989
## CodeScaleBench-Org
9090

9191
Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
9292

9393
| Suite | Category | Tasks | Description |
9494
|-------|----------|------:|-------------|
95-
| `csb_org_onboarding` | Onboarding & Comprehension | 28 | API consumption mapping, end-to-end flow, architecture maps |
96-
| `csb_org_migration` | Framework Migration | 26 | API migrations, breaking changes across repos |
97-
| `csb_org_security` | Vulnerability Remediation | 24 | CVE mapping, missing auth middleware across repos |
98-
| `csb_org_crossrepo_tracing` | Dependency Tracing | 22 | Cross-repo dependency chains, blast radius, symbol resolution |
99-
| `csb_org_domain` | Domain Lineage | 20 | Config propagation, architecture patterns, domain analysis |
100-
| `csb_org_incident` | Incident Debugging | 20 | Error-to-code-path tracing across microservices |
101-
| `csb_org_compliance` | Compliance | 18 | Standards adherence, audit, and provenance workflows |
102-
| `csb_org_platform` | Platform Knowledge | 18 | Service template discovery and tribal knowledge |
103-
| `csb_org_crossorg` | Cross-Org Discovery | 15 | Interface implementations and authoritative repo identification across orgs |
104-
| `csb_org_org` | Organizational Context | 15 | Agentic discovery, org-wide coding correctness |
105-
| `csb_org_crossrepo` | Cross-Repo Discovery | 14 | Cross-repo search, dependency discovery, impact analysis |
106-
| **Total** | | **220** | |
107-
108-
**Combined canonical benchmark: 370 tasks** (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing. An additional 28 backup tasks are archived in `benchmarks/backups/`.
95+
| `csb_org_migration` | Framework Migration | 25 | API migrations, breaking changes across repos |
96+
| `csb_org_compliance` | Compliance | 13 | Standards adherence, audit, and provenance workflows |
97+
| `csb_org_incident` | Incident Debugging | 13 | Error-to-code-path tracing across microservices |
98+
| `csb_org_platform` | Platform Knowledge | 13 | Service template discovery and tribal knowledge |
99+
| `csb_org_security` | Vulnerability Remediation | 13 | CVE mapping, missing auth middleware across repos |
100+
| `csb_org_crossorg` | Cross-Org Discovery | 12 | Interface implementations and authoritative repo identification across orgs |
101+
| `csb_org_crossrepo` | Cross-Repo Discovery | 11 | Cross-repo search, dependency discovery, impact analysis |
102+
| `csb_org_crossrepo_tracing` | Dependency Tracing | 11 | Cross-repo dependency chains, blast radius, symbol resolution |
103+
| `csb_org_domain` | Domain Lineage | 11 | Domain-specific lineage and analysis workflows |
104+
| `csb_org_onboarding` | Onboarding & Comprehension | 11 | API consumption mapping, end-to-end flow, architecture maps |
105+
| `csb_org_org` | Organizational Context | 11 | Agentic discovery, org-wide coding correctness |
106+
| **Total** | | **144** | |
107+
108+
**Combined canonical benchmark: 275 tasks** (131 SDLC across 9 suites + 144 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform sizing. Non-canonical tasks are archived in `benchmarks/backups/`.
109109

110110
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
111111

@@ -135,27 +135,27 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
135135

136136
```
137137
benchmarks/ # Task definitions organized by SDLC phase + Org
138-
csb_sdlc_fix/ # Bug Repair (26 tasks)
139138
csb_sdlc_feature/ # Feature Implementation (23 tasks)
140-
csb_sdlc_debug/ # Debugging & Investigation (18 tasks)
141-
csb_sdlc_test/ # Testing & QA (18 tasks)
142-
csb_sdlc_refactor/ # Cross-File Refactoring (16 tasks)
143-
csb_sdlc_design/ # Architecture & Design (14 tasks)
144-
csb_sdlc_document/ # Documentation (13 tasks)
145-
csb_sdlc_secure/ # Security & Compliance (12 tasks)
146-
csb_sdlc_understand/ # Requirements & Discovery (10 tasks)
147-
backups/ # Archived backup tasks (28 total)
148-
csb_org_onboarding/ # Org: onboarding (28 tasks)
149-
csb_org_migration/ # Org: framework migration (26 tasks)
150-
csb_org_security/ # Org: vulnerability remediation (24 tasks)
151-
csb_org_crossrepo_tracing/ # Org: dependency tracing (22 tasks)
152-
csb_org_domain/ # Org: domain lineage (20 tasks)
153-
csb_org_incident/ # Org: incident debugging (20 tasks)
154-
csb_org_compliance/ # Org: compliance & audit (18 tasks)
155-
csb_org_platform/ # Org: platform knowledge (18 tasks)
156-
csb_org_crossorg/ # Org: cross-org discovery (15 tasks)
157-
csb_org_org/ # Org: org context (15 tasks)
158-
csb_org_crossrepo/ # Org: cross-repo discovery (14 tasks)
139+
csb_sdlc_fix/ # Bug Repair (19 tasks)
140+
csb_sdlc_refactor/ # Cross-File Refactoring (18 tasks)
141+
csb_sdlc_debug/ # Debugging & Investigation (13 tasks)
142+
csb_sdlc_secure/ # Security & Compliance (13 tasks)
143+
csb_sdlc_test/ # Testing & QA (12 tasks)
144+
csb_sdlc_design/ # Architecture & Design (11 tasks)
145+
csb_sdlc_document/ # Documentation (11 tasks)
146+
csb_sdlc_understand/ # Requirements & Discovery (11 tasks)
147+
csb_org_migration/ # Org: framework migration (25 tasks)
148+
csb_org_compliance/ # Org: compliance & audit (13 tasks)
149+
csb_org_incident/ # Org: incident debugging (13 tasks)
150+
csb_org_platform/ # Org: platform knowledge (13 tasks)
151+
csb_org_security/ # Org: vulnerability remediation (13 tasks)
152+
csb_org_crossorg/ # Org: cross-org discovery (12 tasks)
153+
csb_org_crossrepo/ # Org: cross-repo discovery (11 tasks)
154+
csb_org_crossrepo_tracing/ # Org: dependency tracing (11 tasks)
155+
csb_org_domain/ # Org: domain lineage (11 tasks)
156+
csb_org_onboarding/ # Org: onboarding (11 tasks)
157+
csb_org_org/ # Org: org context (11 tasks)
158+
backups/ # Archived non-canonical tasks
159159
configs/ # Run configs and task selection
160160
_common.sh # Shared infra: token refresh, parallel execution, multi-account
161161
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
@@ -169,7 +169,7 @@ configs/ # Run configs and task selection
169169
test_2config.sh # Phase wrapper: Test (20 tasks)
170170
run_selected_tasks.sh # Unified runner for all tasks
171171
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
172-
selected_benchmark_tasks.json # Canonical task selection: 370 tasks (150 SDLC + 220 Org)
172+
selected_benchmark_tasks.json # Canonical task selection: 275 tasks (131 SDLC + 144 Org)
173173
use_case_registry.json # 100 GTM use cases (Org task source)
174174
archive/ # Pre-SDLC migration scripts (preserved for history)
175175
scripts/ # Metrics extraction, evaluation, and operational tooling
@@ -293,10 +293,10 @@ This section assumes Harbor is already installed and configured. If not, start w
293293

294294
### SDLC Tasks
295295

296-
The unified runner executes all 370 canonical tasks across the 2-config matrix:
296+
The unified runner executes all 275 canonical tasks across the 2-config matrix:
297297

298298
```bash
299-
# Run all 370 tasks across 2 configs
299+
# Run all 275 tasks across 2 configs
300300
bash configs/run_selected_tasks.sh
301301

302302
# Run only the baseline config
@@ -312,15 +312,15 @@ bash configs/run_selected_tasks.sh --dry-run
312312
Per-phase runners are also available:
313313

314314
```bash
315-
bash configs/fix_2config.sh # 26 Bug Repair tasks
316315
bash configs/feature_2config.sh # 23 Feature Implementation tasks
317-
bash configs/debug_2config.sh # 18 Debugging & Investigation tasks
318-
bash configs/test_2config.sh # 18 Testing & QA tasks
319-
bash configs/refactor_2config.sh # 16 Cross-File Refactoring tasks
320-
bash configs/design_2config.sh # 14 Architecture & Design tasks
321-
bash configs/document_2config.sh # 13 Documentation tasks
322-
bash configs/secure_2config.sh # 12 Security & Compliance tasks
323-
bash configs/understand_2config.sh # 10 Requirements & Discovery tasks
316+
bash configs/fix_2config.sh # 19 Bug Repair tasks
317+
bash configs/refactor_2config.sh # 18 Cross-File Refactoring tasks
318+
bash configs/debug_2config.sh # 13 Debugging & Investigation tasks
319+
bash configs/secure_2config.sh # 13 Security & Compliance tasks
320+
bash configs/test_2config.sh # 12 Testing & QA tasks
321+
bash configs/design_2config.sh # 11 Architecture & Design tasks
322+
bash configs/document_2config.sh # 11 Documentation tasks
323+
bash configs/understand_2config.sh # 11 Requirements & Discovery tasks
324324
```
325325

326326
### Filtering by Suite

0 commit comments

Comments
 (0)