|
42 | 42 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-09T18:14:35Z","event_type":"claimed","id":42,"issue_id":"CodeScaleBench-25b.5","new_value":"{\"assignee\":\"sjarmak\",\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-25b.5\",\"title\":\"Document hybrid canonical evaluation policy for deterministic and answer.json modes\",\"description\":\"Goal\\nMake the intended model explicit in docs so operators and future contributors stop inferring it from scattered task implementations.\\n\\nScope\\n- Update reference docs for task contract and verifier semantics.\\n- Update ops docs for smoke coverage and launch expectations.\\n- Clarify that deterministic verification is universal, while answer.json/artifact support is policy-driven and currently partial outside fully migrated families.\\n\\nWhy\\nThe current repo narrative is directionally right, but it overstates uniformity if read as applying equally to all 275 canonical tasks today.\",\"acceptance_criteria\":\"1. Reference and ops docs describe the canonical hybrid evaluation policy. 2. Docs state what is universal versus family-specific. 3. Docs explain how answer.json-based artifact evaluation relates to deterministic verifier reward for canonical tasks.\",\"status\":\"open\",\"priority\":2,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:05:19Z\"}"} |
43 | 43 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-09T18:19:04Z","event_type":"closed","id":43,"issue_id":"CodeScaleBench-25b.5","new_value":"Done","old_value":""} |
44 | 44 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-09T18:19:04Z","event_type":"closed","id":44,"issue_id":"CodeScaleBench-25b","new_value":"all steps complete","old_value":""} |
| 45 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:19:06Z","event_type":"updated","id":45,"issue_id":"CodeScaleBench-2kz","new_value":"{\"notes\":\"2026-03-09 validation pass:\\\\n- Fixed stale task generators/templates so fresh org + SDLC scaffolded tasks now render and smoke clean without one-off harness patches.\\\\n- Temp scaffold validation: org template path renders, contract-check passes, and baseline/sg_only smoke runs produce reward artifacts as expected; feature/refactor scaffold outputs pass contract-only plus baseline/sg_only no-agent smoke.\\\\n- Curated local smoke subsets all passed via exact-selection flow: baseline (ccx-onboard-search-207, element-web-unread-indicators-diverge-fix-001, clickhouse-mergetree-arch-understand-001), sg_only (same trio), artifact_only (ccx-onboard-search-207, bustub-hyperloglog-impl-001, nodebb-plugin-validate-fix-001).\\\\n- Prepared rerun manifests: configs/claude_historical_failure_rerun_mcp_20260309.json and configs/openhands_historical_failure_rerun_baseline_20260309.json.\\\\n- Infra readiness checked: account_health.py status recommends proceed; check_infra.py now passes in current workspace.\\\\nRemaining: launch rerun manifests only after interactive confirmation, then classify any residual failures and decide permanent sentinel coverage.\\n2026-03-09 launch started after explicit confirmation.\\\\n- Claude MCP rerun batch launched via configs/run_selected_tasks.sh in Daytona mode using accounts account1/account2/account4 (account3 held, account5 reserved for OpenHands). Run dirs are rooted at runs/staging/csb_org_onboarding_sonnet_20260309_142738, runs/staging/csb_sdlc_feature_sonnet_20260309_142738, runs/staging/csb_sdlc_fix_sonnet_20260309_142738, runs/staging/csb_sdlc_secure_sonnet_20260309_142738, runs/staging/csb_sdlc_understand_sonnet_20260309_142738 under config mcp-remote-direct. Initial live tasks confirmed on disk for ccx-onboard-search-207/208/210.\\\\n- OpenHands baseline sentinel launched via configs/openhands_2config.sh in Daytona mode using account5 only. Run dir: runs/staging/openhands_sonnet46_20260309_142733/baseline-local-direct/.../ccx-onboard-search-212__CDJ962t.\\\\n- Remaining Claude tasks will submit as the 3-slot queue drains.\\\\nNext: monitor task completion/invalids, classify any residual failures, and decide which sentinels stay in permanent smoke coverage.\\n2026-03-09 planning clarification:\\n- SOURCEGRAPH_ACCESS_TOKEN is expected to come from .env.local for operator shells or launcher wrappers; a raw check_harness_readiness.py failure without sourcing .env.local should not be treated as a task-contract regression by itself.\\n- Gemini harness validation is out of scope for the immediate rerun batch; readiness for this bead should be judged against the harnesses actually being used for the reruns.\\n- Keep rerun execution/classification here; track any separate harness-readiness or CI-gating adjustments in a dedicated Beads task.\"}","old_value":"{\"id\":\"CodeScaleBench-2kz\",\"title\":\"Verify harness fixes by rerunning historical Claude/OpenHands failures\",\"description\":\"Run a focused verification batch to prove the current task-contract and harness hardening eliminates the earlier random patch churn.\\n\\nScope:\\n- Claude Code regression sentinels:\\n - mcp_ccx-onboard-search-207\\n - mcp_ccx-onboard-search-208\\n - mcp_ccx-onboard-search-210\\n - mcp_bustub-hyperloglog-impl-001\\n - mcp_django-sensitive-file-exclusion-001\\n - mcp_flink-window-late-data-fix-001\\n - mcp_element-web-unread-indicators-diverge-fix-001\\n - clickhouse-mergetree-arch-understand-001 (confirm Daytona/local routing now that storage metadata was corrected)\\n- OpenHands regression sentinel:\\n - ccx-onboard-search-212\\n\\nAcceptance criteria:\\n- Produce a small rerun manifest or manifests for the tasks above.\\n- Execute the reruns once accounts are ready.\\n- Confirm whether each task now completes as a valid run without ad hoc task-specific patches.\\n- Record any remaining failures as either harness bugs, task bugs, or infra issues with exact root cause.\\n- If clean, note which tasks should remain in the smoke/verification matrix as permanent regression sentinels.\\n\",\"notes\":\"2026-03-09 validation pass:\\\\n- Fixed stale task generators/templates so fresh org + SDLC scaffolded tasks now render and smoke clean without one-off harness patches.\\\\n- Temp scaffold validation: org template path renders, contract-check passes, and baseline/sg_only smoke runs produce reward artifacts as expected; feature/refactor scaffold outputs pass contract-only plus baseline/sg_only no-agent smoke.\\\\n- Curated local smoke subsets all passed via exact-selection flow: baseline (ccx-onboard-search-207, element-web-unread-indicators-diverge-fix-001, clickhouse-mergetree-arch-understand-001), sg_only (same trio), artifact_only (ccx-onboard-search-207, bustub-hyperloglog-impl-001, nodebb-plugin-validate-fix-001).\\\\n- Prepared rerun manifests: configs/claude_historical_failure_rerun_mcp_20260309.json and configs/openhands_historical_failure_rerun_baseline_20260309.json.\\\\n- Infra readiness checked: account_health.py status recommends proceed; check_infra.py now passes in current workspace.\\\\nRemaining: launch rerun manifests only after interactive confirmation, then classify any residual failures and decide permanent sentinel coverage.\\n2026-03-09 launch started after explicit confirmation.\\\\n- Claude MCP rerun batch launched via configs/run_selected_tasks.sh in Daytona mode using accounts account1/account2/account4 (account3 held, account5 reserved for OpenHands). Run dirs are rooted at runs/staging/csb_org_onboarding_sonnet_20260309_142738, runs/staging/csb_sdlc_feature_sonnet_20260309_142738, runs/staging/csb_sdlc_fix_sonnet_20260309_142738, runs/staging/csb_sdlc_secure_sonnet_20260309_142738, runs/staging/csb_sdlc_understand_sonnet_20260309_142738 under config mcp-remote-direct. Initial live tasks confirmed on disk for ccx-onboard-search-207/208/210.\\\\n- OpenHands baseline sentinel launched via configs/openhands_2config.sh in Daytona mode using account5 only. Run dir: runs/staging/openhands_sonnet46_20260309_142733/baseline-local-direct/.../ccx-onboard-search-212__CDJ962t.\\\\n- Remaining Claude tasks will submit as the 3-slot queue drains.\\\\nNext: monitor task completion/invalids, classify any residual failures, and decide which sentinels stay in permanent smoke coverage.\",\"status\":\"in_progress\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T13:11:58Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T14:29:13Z\"}"} |
| 46 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:19:06Z","event_type":"created","id":46,"issue_id":"CodeScaleBench-rm3","new_value":"","old_value":""} |
| 47 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:19:06Z","event_type":"created","id":47,"issue_id":"CodeScaleBench-aa9","new_value":"","old_value":""} |
| 48 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:23:15Z","event_type":"status_changed","id":48,"issue_id":"CodeScaleBench-aa9","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-aa9\",\"title\":\"Complete canonical validation_result migration for remaining verifier families\",\"description\":\"Finish the canonical verifier-contract migration for the remaining 61 tasks that still emit only reward.txt.\\n\\nCurrent remaining families from configs/canonical_evaluation_audit.json:\\n- ir_checklist: 17\\n- checklist: 16\\n- find_and_prove: 8\\n- f1_hybrid: 7\\n- test_ratio: 6\\n- continuous: 5\\n- f1: 2\\n\\nPlanned approach:\\n- Migrate family-by-family using the same validation_result.v1alpha1 contract now used by oracle_checks and repo_state_heuristic.\\n- Preserve reward.txt compatibility while adding /logs/verifier/validation_result.json.\\n- Keep reward separate from pass semantics and emit invalid_output / verifier_error when the task contract requires it.\\n- Regenerate configs/canonical_evaluation_audit.json after each landed batch.\\n- Run representative python3 scripts/validate_tasks_preflight.py --task ... --contract-only --format json checks per family and python3 scripts/repo_health.py before commit/push.\\n\\nSuggested execution order:\\n1. ir_checklist\\n2. checklist\\n3. find_and_prove\\n4. f1_hybrid\\n5. test_ratio\\n6. continuous\\n7. f1\\n\",\"acceptance_criteria\":\"All remaining tasks in configs/canonical_evaluation_audit.json migrate from structured_output_mode=none to validation_result; representative contract-only preflight checks pass for each remaining family; python3 scripts/repo_health.py passes; changes are committed and pushed on main.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T20:19:06Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T20:19:06Z\"}"} |
| 49 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:23:23Z","event_type":"updated","id":49,"issue_id":"CodeScaleBench-rm3","new_value":"{\"notes\":\"Verified: all harnesses pass readiness (codex, cursor, gemini, copilot, openhands). SG token loads from .env.local (61 chars). Gemini passes but is out of scope for immediate reruns.\"}","old_value":"{\"id\":\"CodeScaleBench-rm3\",\"title\":\"Validate active-harness CI gating before pending rerun batches\",\"description\":\"Track the operational gating work needed before treating the pending reruns as ready for harness-agnostic CI or launch checks.\\n\\nScope:\\n- Treat active harnesses separately from the full registry-wide check when Gemini is not in scope.\\n- Confirm SOURCEGRAPH_ACCESS_TOKEN is sourced from .env.local (or equivalent launcher path) before running readiness checks.\\n- Validate the relevant readiness commands for the immediate rerun work, such as:\\n - python3 scripts/check_harness_readiness.py --harness codex --format json\\n - equivalent checks for other active harnesses as needed\\n- Confirm the previously failed rerun workflow can be gated without requiring unrelated harness credentials.\\n- Document any remaining blocker as either env setup, launcher bug, or harness-specific requirement.\\n\\nThis is separate from task-contract migration work and separate from the historical rerun execution/classification task already tracked in Beads.\\n\",\"acceptance_criteria\":\"The readiness path for the harnesses actually in scope is documented and verified; SOURCEGRAPH_ACCESS_TOKEN is confirmed to load from .env.local for operator shells or launcher wrappers; Gemini is explicitly excluded from the immediate rerun gate; the exact commands to gate and launch the pending reruns are recorded in the issue notes or description.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T20:19:06Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T20:19:06Z\"}"} |
| 50 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:23:24Z","event_type":"closed","id":50,"issue_id":"CodeScaleBench-rm3","new_value":"All harnesses pass readiness checks. SG token confirmed from .env.local. Gemini excluded from immediate gate.","old_value":""} |
| 51 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T20:31:37Z","event_type":"closed","id":51,"issue_id":"CodeScaleBench-aa9","new_value":"All 264 active tasks now emit validation_result.json (v1alpha1). 50 tasks migrated across 6 families: ir_checklist(17), checklist(16), f1_hybrid(7), continuous(5), test_ratio(3), f1(2). Commit be8bff87f.","old_value":""} |
0 commit comments