Skip to content

Commit d0b2fb6

Browse files
committed
feat: update .gitignore for Python artifacts, enhance AGENTS.md with operational guidelines, refine architecture documentation, and improve run constraints tests
1 parent 7108a31 commit d0b2fb6

7 files changed

Lines changed: 316 additions & 20 deletions

File tree

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,9 @@ pnpm-debug.log*
4040

4141
# Generated media / local previews
4242
output/
43+
44+
# python
45+
__pycache__/
46+
*.pyc
47+
*.pyo
48+
*.pyd

AGENTS.md

Lines changed: 60 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,28 @@ Use the docs above as the canonical source for detailed rules.
3131

3232
---
3333

34+
## Codex Operating Style
35+
36+
- Read this file first, then open the relevant source-of-truth docs before editing behaviorally significant code.
37+
- Prefer explicit, minimally scoped changes over broad speculative rewrites.
38+
- Preserve the repository’s governed workflow, auditability, and observable artifacts.
39+
- When a rule here conflicts with a more detailed rule in the docs, follow the docs.
40+
41+
---
42+
43+
## Skill Usage
44+
45+
Use repository-local skills in `.codex/skills` when the task matches their scope.
46+
47+
- Use the TUI/live-validation skill for interactive behavior, slash-command flows, session reload/resume issues, or anything that must be verified in a real TUI/web flow.
48+
- Use the research-quality or paper-writing skill for claim-evidence alignment, paper-readiness checks, related-work depth, or genre/readiness downgrades.
49+
- Use the reproducibility or artifact-validation skill for checkpoint/state consistency, artifact verification, harness expectations, or run-output audits.
50+
- Use the brief/plan-oriented skill when creating or repairing governed research briefs or execution plans.
51+
52+
If a relevant skill exists, follow it in addition to this file and the docs.
53+
54+
---
55+
3456
## Working Rules
3557

3658
- Plan briefly before editing when the issue is complex.
@@ -46,18 +68,40 @@ Use the docs above as the canonical source for detailed rules.
4668

4769
## Workflow Contract
4870

49-
The repository currently operates around a 9-node research workflow with bounded transitions, built-in backtracking, and checkpointed artifacts.
71+
The repository currently operates around a governed 9-node research workflow with bounded transitions, built-in backtracking, and checkpointed artifacts.
5072

51-
- Do not casually remove, reorder, or redefine existing core workflow stages.
52-
- If the need is strong, a new node may be added, but only when it clearly improves the research/runtime contract rather than duplicating existing responsibilities.
73+
- Do not casually add, remove, reorder, or redefine top-level workflow nodes.
74+
- Treat the 9-node structure as fixed unless there is an explicit contract change reflected in docs, runtime behavior, and validation expectations.
5375
- Any workflow-structure change must preserve:
5476
- inspectable state transitions
5577
- artifact audibility
5678
- reproducibility
5779
- review gating
5880
- claim-ceiling discipline
5981
- safe backtracking behavior
60-
- Any structural workflow change must be reflected in the relevant docs, runtime behavior, and validation expectations.
82+
83+
---
84+
85+
## Required Validation Commands
86+
87+
Run the smallest relevant validation set that honestly covers the change.
88+
89+
Core commands:
90+
91+
- `npm run build` — required when changing shipped TypeScript/runtime or web build behavior
92+
- `npm test` — required for most code changes affecting logic, state handling, workflow control, or CLI/TUI behavior
93+
- `npm run test:web` — required when changing web UI behavior or web-facing interactive flows
94+
- `npm run validate:harness` — required when changing harness expectations, governed workflow contracts, issue/brief/review artifacts, or reproducibility-facing validation logic
95+
96+
Targeted smoke checks when relevant:
97+
98+
- `npm run test:smoke:natural-collect`
99+
- `npm run test:smoke:natural-collect-execute`
100+
- `npm run test:smoke:all`
101+
- other targeted smoke scripts in `tests/smoke/`
102+
103+
For interactive defects, do not rely on tests alone if a real TUI/web flow can be run.
104+
Re-run the same flow that exposed the issue.
61105

62106
---
63107

@@ -155,13 +199,23 @@ Every brief must include:
155199
- Constraints
156200
- Plan
157201
- Research Question
158-
- Why This Can Be Tested With a Small Real Experiment
202+
- Why This Can Be Tested With A Small Real Experiment
159203
- Baseline / Comparator
160204
- Dataset / Task / Bench
205+
- Target Comparison
206+
- Minimum Acceptable Evidence
207+
- Disallowed Shortcuts
208+
- Allowed Budgeted Passes
209+
- Paper Ceiling If Evidence Remains Weak
161210
- Minimum Experiment Plan
162211
- Paper-worthiness Gate
163212
- Failure Conditions
164213

214+
Recommended when uncertainty is material:
215+
216+
- Notes
217+
- Questions / Risks
218+
165219
Do not treat missing governance fields as harmless omissions for paper-scale work.
166220

167221
---
@@ -178,6 +232,7 @@ It is also the running log for:
178232
Keep these categories distinct.
179233

180234
For interactive defects, always prefer:
235+
181236
1. real reproduction
182237
2. issue logging
183238
3. smallest plausible root-cause fix

docs/architecture.md

Lines changed: 105 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,43 +2,136 @@
22

33
This document captures the runtime contracts that must remain stable while improving quality enforcement.
44

5-
## 1) Fixed workflow contract
5+
## 1) Governed workflow contract
66

7-
AutoLabOS runs a fixed 9-node research workflow:
7+
AutoLabOS operates around a governed 9-node research workflow:
88

99
`collect_papers -> analyze_papers -> generate_hypotheses -> design_experiments -> implement_experiments -> run_experiments -> analyze_results -> review -> write_paper`
1010

11-
Do not add, remove, or reorder top-level nodes without an explicit contract change.
11+
This 9-node structure is the default top-level workflow contract and must remain stable unless an explicit contract change is made.
12+
13+
Do not casually add, remove, reorder, or redefine top-level nodes.
14+
15+
A top-level workflow change is allowed only when all of the following are true:
16+
17+
- the change clearly improves the research/runtime contract rather than duplicating an existing stage
18+
- inspectable state transitions are preserved
19+
- artifact audibility is preserved
20+
- reproducibility is preserved
21+
- review gating and claim-ceiling discipline are preserved
22+
- safe backtracking behavior is preserved
23+
- the change is reflected consistently in docs, runtime behavior, and validation expectations
24+
25+
Until those conditions are met, treat the 9-node workflow as fixed.
1226

1327
## 2) Shared runtime surfaces
1428

1529
- TUI (`autolabos`) and local web ops UI (`autolabos web`) share the same interaction/runtime layer.
1630
- Node execution and transitions are controlled by `StateGraphRuntime`.
1731
- Approval mode and transition recommendation behavior are part of runtime contracts.
1832

19-
Harness work must preserve both TUI and web behaviors unless a change is explicitly requested.
33+
Harness and runtime work must preserve both TUI and web behaviors unless a change is explicitly requested.
2034

2135
## 3) Artifact model
2236

23-
- Run-scoped source of truth: `.autolabos/runs/<run_id>/...`
24-
- Public mirrored outputs: `outputs/<run-title>-<run_id_prefix>/...`
37+
- Run-scoped source of truth: `.autolabos/runs/<run-id>/...`
38+
- Public mirrored outputs: `outputs/<run-id>-...`
2539
- Checkpoints and run context are persisted under each run directory.
2640

2741
Quality checks should be deterministic and file-based whenever possible.
2842

43+
Public-facing outputs must remain traceable to underlying run artifacts.
44+
2945
## 4) Node-internal loops are bounded
3046

31-
Internal control loops inside nodes (for example analyze/design/run/analyze/write) are allowed and expected, but they must remain bounded and auditable through artifacts/logs.
47+
Internal control loops inside nodes are allowed and expected, including loops in analysis, design, implementation, execution, result interpretation, and writing.
48+
49+
However, these loops must remain:
50+
51+
- bounded
52+
- auditable through artifacts or logs
53+
- consistent with node purpose
54+
- non-destructive to top-level workflow clarity
55+
56+
Node-internal iteration must not be used to smuggle in an undeclared top-level workflow redesign.
57+
58+
## 5) Review and paper-readiness contract
59+
60+
`review` is a gate, not a cosmetic pass.
61+
62+
The system must not treat workflow completion, `write_paper` completion, or successful PDF generation as equivalent to paper-ready research.
63+
64+
Top-level progression to paper-writing behavior should preserve the distinction between:
65+
66+
- system completion
67+
- artifact completion
68+
- research completion
69+
- paper readiness
70+
71+
A paper-scale outcome requires evidence beyond successful orchestration, including baseline/comparator presence, real experiment execution, quantitative comparison, and claim-to-evidence linkage.
3272

33-
## 5) Harness engineering goals
73+
## 6) Research brief contract
74+
75+
A governed run should begin from a research brief that defines the execution contract.
76+
77+
At minimum, the brief structure should align with `docs/research-brief-template.md`, including:
78+
79+
- Topic
80+
- Objective Metric
81+
- Constraints
82+
- Plan
83+
- Research Question
84+
- Why This Can Be Tested With A Small Real Experiment
85+
- Baseline / Comparator
86+
- Dataset / Task / Bench
87+
- Target Comparison
88+
- Minimum Acceptable Evidence
89+
- Disallowed Shortcuts
90+
- Allowed Budgeted Passes
91+
- Paper Ceiling If Evidence Remains Weak
92+
- Minimum Experiment Plan
93+
- Paper-worthiness Gate
94+
- Failure Conditions
95+
96+
Missing governance fields should be treated as execution risks, not harmless omissions.
97+
98+
## 7) Validation surfaces are first-class
99+
100+
The following are first-class validation surfaces for contract enforcement:
101+
102+
- real TUI validation
103+
- local web validation
104+
- targeted tests
105+
- smoke checks
106+
- harness validation
107+
- artifact inspection
108+
- `/doctor` diagnostics when applicable
109+
110+
For interactive defects, real behavior is the primary ground truth.
111+
Tests and harness checks support but do not replace same-flow revalidation.
112+
113+
## 8) Harness engineering goals
34114

35115
- Turn important quality assumptions into explicit checks.
36116
- Keep checks cheap enough for routine CI.
37-
- Fail early on structural incompleteness (missing required artifacts, malformed records).
117+
- Fail early on structural incompleteness such as missing required artifacts or malformed records.
38118
- Keep enforcement incremental and compatible with current contracts.
119+
- Prefer minimal, high-confidence enforcement that improves observability and reproducibility.
120+
121+
## 9) Reproducibility contract
122+
123+
A run should not be treated as trustworthy unless its outputs and transitions can be inspected and rechecked.
124+
125+
When applicable, validation should confirm:
126+
127+
- checkpoint/state consistency
128+
- consistency between public-facing outputs and run-scoped artifacts
129+
- observable behavioral change, not only modified code paths
130+
- explicitly stated remaining validation or reproducibility gaps
39131

40-
## 6) Non-goals for this track
132+
## 10) Non-goals for this track
41133

42-
- No redesign of product UX.
43-
- No broad refactor of orchestration/runtime.
134+
- No redesign of product UX without an explicit product-direction decision.
135+
- No broad refactor of orchestration/runtime without contract justification.
44136
- No speculative replacement of existing node logic.
137+
- No weakening of review gating, evidence discipline, or reproducibility expectations for convenience.

tests/runConstraints.test.ts

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,19 @@ describe("normalizeConstraintProfile", () => {
1919
expect(profile.collect.publicationTypes).toEqual(["Review"]);
2020
});
2121

22-
it("returns no automatic topic-derived candidates when llm queries are absent", () => {
22+
it("builds deterministic topic-derived candidates when llm queries are absent", () => {
2323
const candidates = buildLiteratureQueryCandidates({
2424
runTopic: "Resource-aware baselines for tabular classification on small public datasets"
2525
});
2626

27-
expect(candidates).toEqual([]);
27+
expect(candidates).toContainEqual({
28+
query: "Resource-aware baselines for tabular classification on small public datasets",
29+
reason: "run_topic"
30+
});
31+
expect(candidates).toContainEqual({
32+
query: "baselines for tabular classification",
33+
reason: "constraint_stripped"
34+
});
2835
});
2936

3037
it("prefers an explicit requested query and does not append llm-generated fallbacks", () => {

tests/smoke/common.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@ smoke_require_expect() {
2929
fi
3030
}
3131

32+
smoke_has_expect() {
33+
command -v expect >/dev/null 2>&1
34+
}
35+
3236
smoke_prepare_workspace() {
3337
rm -rf "$SMOKE_WORK_DIR/.autolabos"
3438
mkdir -p "$SMOKE_WORK_DIR/.autolabos/runs" "$SMOKE_WORK_DIR/.autolabos/logs"
@@ -202,6 +206,11 @@ smoke_run_expect() {
202206
expect "$SMOKE_ROOT_DIR/tests/smoke/$exp_name" "$SMOKE_WORK_DIR" "$run_id"
203207
}
204208

209+
smoke_run_pending_without_expect() {
210+
local run_id="$1"
211+
python3 "$SMOKE_ROOT_DIR/tests/smoke/pending_smoke_without_expect.py" "$SMOKE_WORK_DIR" "$run_id"
212+
}
213+
205214
smoke_bib_key_for_prefix() {
206215
local prefix="$1"
207216
printf '%s1' "${prefix//[^[:alnum:]]/}"

0 commit comments

Comments
 (0)