feat: update .gitignore for Python artifacts, enhance AGENTS.md with operational guidelines, refine architecture documentation, and improve run constraints tests

lhy0718 · lhy0718 · commit d0b2fb6b879b · 2026-03-19T11:22:31.000+09:00
diff --git a/.gitignore b/.gitignore
@@ -40,3 +40,9 @@ pnpm-debug.log*
 
 # Generated media / local previews
 output/
+
+# python
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
diff --git a/AGENTS.md b/AGENTS.md
@@ -31,6 +31,28 @@ Use the docs above as the canonical source for detailed rules.
 
 ---
 
+## Codex Operating Style
+
+- Read this file first, then open the relevant source-of-truth docs before editing behaviorally significant code.
+- Prefer explicit, minimally scoped changes over broad speculative rewrites.
+- Preserve the repository’s governed workflow, auditability, and observable artifacts.
+- When a rule here conflicts with a more detailed rule in the docs, follow the docs.
+
+---
+
+## Skill Usage
+
+Use repository-local skills in `.codex/skills` when the task matches their scope.
+
+- Use the TUI/live-validation skill for interactive behavior, slash-command flows, session reload/resume issues, or anything that must be verified in a real TUI/web flow.
+- Use the research-quality or paper-writing skill for claim-evidence alignment, paper-readiness checks, related-work depth, or genre/readiness downgrades.
+- Use the reproducibility or artifact-validation skill for checkpoint/state consistency, artifact verification, harness expectations, or run-output audits.
+- Use the brief/plan-oriented skill when creating or repairing governed research briefs or execution plans.
+
+If a relevant skill exists, follow it in addition to this file and the docs.
+
+---
+
 ## Working Rules
 
 - Plan briefly before editing when the issue is complex.
@@ -46,18 +68,40 @@ Use the docs above as the canonical source for detailed rules.
 
 ## Workflow Contract
 
-The repository currently operates around a 9-node research workflow with bounded transitions, built-in backtracking, and checkpointed artifacts.
+The repository currently operates around a governed 9-node research workflow with bounded transitions, built-in backtracking, and checkpointed artifacts.
 
-- Do not casually remove, reorder, or redefine existing core workflow stages.
-- If the need is strong, a new node may be added, but only when it clearly improves the research/runtime contract rather than duplicating existing responsibilities.
+- Do not casually add, remove, reorder, or redefine top-level workflow nodes.
+- Treat the 9-node structure as fixed unless there is an explicit contract change reflected in docs, runtime behavior, and validation expectations.
 - Any workflow-structure change must preserve:
   - inspectable state transitions
   - artifact audibility
   - reproducibility
   - review gating
   - claim-ceiling discipline
   - safe backtracking behavior
-- Any structural workflow change must be reflected in the relevant docs, runtime behavior, and validation expectations.
+
+---
+
+## Required Validation Commands
+
+Run the smallest relevant validation set that honestly covers the change.
+
+Core commands:
+
+- `npm run build` — required when changing shipped TypeScript/runtime or web build behavior
+- `npm test` — required for most code changes affecting logic, state handling, workflow control, or CLI/TUI behavior
+- `npm run test:web` — required when changing web UI behavior or web-facing interactive flows
+- `npm run validate:harness` — required when changing harness expectations, governed workflow contracts, issue/brief/review artifacts, or reproducibility-facing validation logic
+
+Targeted smoke checks when relevant:
+
+- `npm run test:smoke:natural-collect`
+- `npm run test:smoke:natural-collect-execute`
+- `npm run test:smoke:all`
+- other targeted smoke scripts in `tests/smoke/`
+
+For interactive defects, do not rely on tests alone if a real TUI/web flow can be run.
+Re-run the same flow that exposed the issue.
 
 ---
 
@@ -155,13 +199,23 @@ Every brief must include:
 - Constraints
 - Plan
 - Research Question
-- Why This Can Be Tested With a Small Real Experiment
+- Why This Can Be Tested With A Small Real Experiment
 - Baseline / Comparator
 - Dataset / Task / Bench
+- Target Comparison
+- Minimum Acceptable Evidence
+- Disallowed Shortcuts
+- Allowed Budgeted Passes
+- Paper Ceiling If Evidence Remains Weak
 - Minimum Experiment Plan
 - Paper-worthiness Gate
 - Failure Conditions
 
+Recommended when uncertainty is material:
+
+- Notes
+- Questions / Risks
+
 Do not treat missing governance fields as harmless omissions for paper-scale work.
 
 ---
@@ -178,6 +232,7 @@ It is also the running log for:
 Keep these categories distinct.
 
 For interactive defects, always prefer:
+
 1. real reproduction
 2. issue logging
 3. smallest plausible root-cause fix
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -2,43 +2,136 @@
 
 This document captures the runtime contracts that must remain stable while improving quality enforcement.
 
-## 1) Fixed workflow contract
+## 1) Governed workflow contract
 
-AutoLabOS runs a fixed 9-node research workflow:
+AutoLabOS operates around a governed 9-node research workflow:
 
 `collect_papers -> analyze_papers -> generate_hypotheses -> design_experiments -> implement_experiments -> run_experiments -> analyze_results -> review -> write_paper`
 
-Do not add, remove, or reorder top-level nodes without an explicit contract change.
+This 9-node structure is the default top-level workflow contract and must remain stable unless an explicit contract change is made.
+
+Do not casually add, remove, reorder, or redefine top-level nodes.
+
+A top-level workflow change is allowed only when all of the following are true:
+
+- the change clearly improves the research/runtime contract rather than duplicating an existing stage
+- inspectable state transitions are preserved
+- artifact audibility is preserved
+- reproducibility is preserved
+- review gating and claim-ceiling discipline are preserved
+- safe backtracking behavior is preserved
+- the change is reflected consistently in docs, runtime behavior, and validation expectations
+
+Until those conditions are met, treat the 9-node workflow as fixed.
 
 ## 2) Shared runtime surfaces
 
 - TUI (`autolabos`) and local web ops UI (`autolabos web`) share the same interaction/runtime layer.
 - Node execution and transitions are controlled by `StateGraphRuntime`.
 - Approval mode and transition recommendation behavior are part of runtime contracts.
 
-Harness work must preserve both TUI and web behaviors unless a change is explicitly requested.
+Harness and runtime work must preserve both TUI and web behaviors unless a change is explicitly requested.
 
 ## 3) Artifact model
 
-- Run-scoped source of truth: `.autolabos/runs/<run_id>/...`
-- Public mirrored outputs: `outputs/<run-title>-<run_id_prefix>/...`
+- Run-scoped source of truth: `.autolabos/runs/<run-id>/...`
+- Public mirrored outputs: `outputs/<run-id>-...`
 - Checkpoints and run context are persisted under each run directory.
 
 Quality checks should be deterministic and file-based whenever possible.
 
+Public-facing outputs must remain traceable to underlying run artifacts.
+
 ## 4) Node-internal loops are bounded
 
-Internal control loops inside nodes (for example analyze/design/run/analyze/write) are allowed and expected, but they must remain bounded and auditable through artifacts/logs.
+Internal control loops inside nodes are allowed and expected, including loops in analysis, design, implementation, execution, result interpretation, and writing.
+
+However, these loops must remain:
+
+- bounded
+- auditable through artifacts or logs
+- consistent with node purpose
+- non-destructive to top-level workflow clarity
+
+Node-internal iteration must not be used to smuggle in an undeclared top-level workflow redesign.
+
+## 5) Review and paper-readiness contract
+
+`review` is a gate, not a cosmetic pass.
+
+The system must not treat workflow completion, `write_paper` completion, or successful PDF generation as equivalent to paper-ready research.
+
+Top-level progression to paper-writing behavior should preserve the distinction between:
+
+- system completion
+- artifact completion
+- research completion
+- paper readiness
+
+A paper-scale outcome requires evidence beyond successful orchestration, including baseline/comparator presence, real experiment execution, quantitative comparison, and claim-to-evidence linkage.
 
-## 5) Harness engineering goals
+## 6) Research brief contract
+
+A governed run should begin from a research brief that defines the execution contract.
+
+At minimum, the brief structure should align with `docs/research-brief-template.md`, including:
+
+- Topic
+- Objective Metric
+- Constraints
+- Plan
+- Research Question
+- Why This Can Be Tested With A Small Real Experiment
+- Baseline / Comparator
+- Dataset / Task / Bench
+- Target Comparison
+- Minimum Acceptable Evidence
+- Disallowed Shortcuts
+- Allowed Budgeted Passes
+- Paper Ceiling If Evidence Remains Weak
+- Minimum Experiment Plan
+- Paper-worthiness Gate
+- Failure Conditions
+
+Missing governance fields should be treated as execution risks, not harmless omissions.
+
+## 7) Validation surfaces are first-class
+
+The following are first-class validation surfaces for contract enforcement:
+
+- real TUI validation
+- local web validation
+- targeted tests
+- smoke checks
+- harness validation
+- artifact inspection
+- `/doctor` diagnostics when applicable
+
+For interactive defects, real behavior is the primary ground truth.
+Tests and harness checks support but do not replace same-flow revalidation.
+
+## 8) Harness engineering goals
 
 - Turn important quality assumptions into explicit checks.
 - Keep checks cheap enough for routine CI.
-- Fail early on structural incompleteness (missing required artifacts, malformed records).
+- Fail early on structural incompleteness such as missing required artifacts or malformed records.
 - Keep enforcement incremental and compatible with current contracts.
+- Prefer minimal, high-confidence enforcement that improves observability and reproducibility.
+
+## 9) Reproducibility contract
+
+A run should not be treated as trustworthy unless its outputs and transitions can be inspected and rechecked.
+
+When applicable, validation should confirm:
+
+- checkpoint/state consistency
+- consistency between public-facing outputs and run-scoped artifacts
+- observable behavioral change, not only modified code paths
+- explicitly stated remaining validation or reproducibility gaps
 
-## 6) Non-goals for this track
+## 10) Non-goals for this track
 
-- No redesign of product UX.
-- No broad refactor of orchestration/runtime.
+- No redesign of product UX without an explicit product-direction decision.
+- No broad refactor of orchestration/runtime without contract justification.
 - No speculative replacement of existing node logic.
+- No weakening of review gating, evidence discipline, or reproducibility expectations for convenience.
diff --git a/tests/runConstraints.test.ts b/tests/runConstraints.test.ts
@@ -19,12 +19,19 @@ describe("normalizeConstraintProfile", () => {
     expect(profile.collect.publicationTypes).toEqual(["Review"]);
   });
 
-  it("returns no automatic topic-derived candidates when llm queries are absent", () => {
+  it("builds deterministic topic-derived candidates when llm queries are absent", () => {
     const candidates = buildLiteratureQueryCandidates({
       runTopic: "Resource-aware baselines for tabular classification on small public datasets"
     });
 
-    expect(candidates).toEqual([]);
+    expect(candidates).toContainEqual({
+      query: "Resource-aware baselines for tabular classification on small public datasets",
+      reason: "run_topic"
+    });
+    expect(candidates).toContainEqual({
+      query: "baselines for tabular classification",
+      reason: "constraint_stripped"
+    });
   });
 
   it("prefers an explicit requested query and does not append llm-generated fallbacks", () => {
diff --git a/tests/smoke/common.sh b/tests/smoke/common.sh
@@ -29,6 +29,10 @@ smoke_require_expect() {
   fi
 }
 
+smoke_has_expect() {
+  command -v expect >/dev/null 2>&1
+}
+
 smoke_prepare_workspace() {
   rm -rf "$SMOKE_WORK_DIR/.autolabos"
   mkdir -p "$SMOKE_WORK_DIR/.autolabos/runs" "$SMOKE_WORK_DIR/.autolabos/logs"
@@ -202,6 +206,11 @@ smoke_run_expect() {
   expect "$SMOKE_ROOT_DIR/tests/smoke/$exp_name" "$SMOKE_WORK_DIR" "$run_id"
 }
 
+smoke_run_pending_without_expect() {
+  local run_id="$1"
+  python3 "$SMOKE_ROOT_DIR/tests/smoke/pending_smoke_without_expect.py" "$SMOKE_WORK_DIR" "$run_id"
+}
+
 smoke_bib_key_for_prefix() {
   local prefix="$1"
   printf '%s1' "${prefix//[^[:alnum:]]/}"
diff --git a/tests/smoke/pending_smoke_without_expect.py b/tests/smoke/pending_smoke_without_expect.py
diff --git a/tests/smoke/run_ci_smoke.sh b/tests/smoke/run_ci_smoke.sh