fix(terminal): filter terminal query sequences from captured output#2245
fix(terminal): filter terminal query sequences from captured output#2245jpshackelford wants to merge 10 commits intomainfrom
Conversation
API breakage checks (Griffe)Result: Failed Log excerpt (first 1000 characters) |
Agent server REST API breakage checks (OpenAPI)Result: Passed |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable - Core solution is sound and pragmatic, but has test quality issues and scope creep.
Key Insight: The flush_stdin() implementation correctly handles terminal I/O cleanup, but the tests need work and there's an unrelated change mixed in.
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||||||||||||
Add flush_stdin() function to prevent ANSI escape code responses from terminal queries (like DSR cursor position requests) from leaking to stdin. This prevents garbage characters appearing in the shell prompt or corrupting subsequent input() calls in CLI applications. The fix has three parts: 1. Add flush_stdin() function to the logger module that drains pending stdin data using non-blocking reads with termios 2. Call flush_stdin() after each agent step in LocalConversation.run() 3. Call flush_stdin() before rendering in DefaultConversationVisualizer 4. Register flush_stdin() with atexit for final cleanup The function gracefully handles: - Non-TTY environments (CI, piped commands) - Windows (where termios is not available) - Various error conditions (OSError, termios.error) Fixes #2244 Co-authored-by: openhands <openhands@all-hands.dev>
a05c3c8 to
8730c48
Compare
Co-authored-by: openhands <openhands@all-hands.dev>
Note on API Breakage Check CommentThe API breakage comment above shows "Failed" but this is misleading and not blocking. What's happening:
Why it's not blocking:
Resolution:The comment will auto-correct once a new version is released that includes the |
- Replace useless Windows test that mocked the function being tested with a proper skip that documents Windows CI covers this path - Replace weak import-only integration tests with actual behavioral tests: - Test that flush_stdin is called in visualizer.on_event() - Test that atexit handler is registered - Consolidate importlib.util import at module level Addresses review feedback on PR #2245 Co-authored-by: openhands <openhands@all-hands.dev>
|
@juanmichelini This PR doesn't change prompts but it does impact the way OpenHands manipulates shell commands and tools and I wouldn't want it to have any kind of negative impacts on our benchmarks. Not sure if you have a process for watching higher risk PRs or doing light weight evals on them. You'd definitely have a better eye than me for what could create issues... |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable - Pragmatic fix for a real terminal I/O issue.
Key Insight: Simple, defensive implementation that solves the escape code leak problem using standard Unix patterns. Clean error handling and proper resource cleanup.
Verdict: ✅ LGTM - The implementation is solid, well-documented, and handles edge cases correctly. All previous review concerns have been addressed.
|
@OpenHands Understand this PR and the linked issue it aims to solve, assess the issue yourself. Do a /codereview-roasted on this PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
Taste rating: RED (needs improvement) This is a real bug (terminal DSR/OSC query responses landing in the TTY input buffer), and the general “drain pending stdin bytes” approach is pragmatic. But the current implementation has one outright correctness bug in the termios handling, plus a couple of design choices that risk eating legitimate user input and introducing surprising import-time side effects. [CRITICAL ISSUES] (must fix)
[IMPROVEMENT OPPORTUNITIES] (should fix)
[BREAKING CHANGE / BEHAVIOR RISK] (what might you break?)
[SECURITY / CORRECTNESS] (only the real stuff)
VERDICTNeeds rework before I’d be comfortable calling this “done.” The overall approach is fine, but the termios restore bug is a hard correctness failure, and the flush-call placement (especially in the visualizer) is too aggressive for interactive UX. Key insight: the hard part isn’t “drain stdin” — it’s drain only when it’s safe, and don’t accidentally leave the terminal in a modified state. |
We typically figure it out at review time, and yes, we used to run some ~50 instances evals on them. This is a good point... the reviewer agent doesn't know. @OpenHands Read this #2245 (comment) and look up the workflow running the pr-review agent. Understand what prompt file the reviewer uses. Investigate the agent loop in the SDK, and add to the reviewer prompt to not approve PRs that change behavior in a way that risks evaluation performance, leave those to human maintainers. |
|
@enyst You must be using a different model for your roasted review : ) |
|
I'm on it! enyst can track my progress at all-hands.dev |
Yes! That's why I do it! GPT-5.2, as safety net for over-happy Claudes. Granted, lately it probably exaggerates in the other direction, but that's why we haz a use for humans. 😅 |
|
@enyst Give me a signal when you are done looking at it and I'll address the roasted issues and any others. |
|
OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server See the conversationfor more information. |
All good, the agent took the proposed reviewer agent changes to another PR. Re: the roasted review. If you want, we could also reuse that conversation to ask it for a re-assessment of its own roasted review, sometimes it figures out where it exaggerated (or not). Actually let me try it |
|
(OpenHands-GPT-5.2; note that in general, it's so prompt-sensitive that once I used words like "re-assessment" it will strongly push it to give a different answer/framing; still the core stuff should(tm) be apparent) Re-evaluation of my review (section by section)Taste ratingAfter thinking harder: still “needs improvement,” but closer to “yellow/acceptable with a couple fixes” than “red dumpster fire.” The core idea is fine; the main question is whether the implementation is safe and provably correct. [CRITICAL ISSUES] — are they really critical?1) Termios restore bug due to shallow copyClaim: Is it really the case? Yes. Is it as major as I made it sound? Slightly nuanced:
Conclusion: Still a real correctness bug and still worth treating as must-fix, mostly because it’s trivial to fix and avoids terminal-state corruption in edge-but-real environments. 2) “flush before render can drop real user input”Claim: Is it really the case? Yes. The function drains whatever bytes are pending on stdin. If the user types while output is happening (typeahead), those bytes are indistinguishable from terminal query responses at this level and will be eaten. Is it as major as I made it sound? This one depends on product expectations:
Also: calling Conclusion: The risk is real; I’d downgrade it from “definitely critical everywhere” to high-risk behavior change that needs justification or a narrower call site. The “right” fix depends on intended interaction model. [IMPROVEMENT OPPORTUNITIES] — did I overstate them?3) Import-time
|
|
(HUMAN) Last note:
We support user messages sent at any time, including during agent's run, for sure. All yours! |
Manual Testing Results on macOSTesting ApproachTested this PR branch on macOS using the reproduction scripts from the linked gist. Installation command: uv run --with "openhands-sdk @ git+https://github.com/OpenHands/software-agent-sdk.git@fix/stdin-escape-code-leak#subdirectory=openhands-sdk" \
--with openhands-tools python repro_real.pyResults
Observed inline escape codes: These are:
AssessmentThe fix successfully addresses the main bug:
Minor cosmetic issue remaining: Escape codes still appear inline in the terminal output between the Observation and Agent Action rendering. This appears to be a timing issue where terminal responses are echoed to stdout before Next Steps
|
Manual Testing Results on macOSTesting ApproachTested this PR branch on macOS using the reproduction scripts from the linked gist. Installation command: uv run --with "openhands-sdk @ git+https://github.com/OpenHands/software-agent-sdk.git@fix/stdin-escape-code-leak#subdirectory=openhands-sdk" \
--with openhands-tools python repro_real.pyResults
Observed inline escape codes: These are:
AssessmentThe fix successfully addresses the main stdin leak bug:
However, escape codes still appear inline in the terminal output between the Observation and Agent Action rendering. Suggested Next StepsThe inline escape codes appear to be a timing issue - terminal responses are being echoed to stdout before
|
The previous implementation used list(old) which creates a shallow copy. Since old[6] (the cc array) is itself a list, both old[6] and new[6] pointed to the same object. Modifying new[6][VMIN] and new[6][VTIME] also corrupted old[6], making the restore at the end ineffective. This fix uses a comprehension that copies nested lists, ensuring the original termios settings are preserved for proper restoration. Co-authored-by: openhands <openhands@all-hands.dev>
Fix for termios shallow-copy bugBased on the detailed analysis, I've pushed a fix for the termios shallow-copy issue identified in the review. The ProblemThe original code used old = termios.tcgetattr(_sys.stdin)
new = list(old) # ← Shallow copy
new[6][termios.VMIN] = 0 # ← Also modifies old[6]!Since Verified with a quick test: >>> old[6] is new[6]
True # Same object!
>>> new[6][termios.VMIN] = 99
>>> old[6][termios.VMIN]
99 # Oops, old is corrupted tooThe Fix (commit ae64c17)Changed to a proper deep copy using a comprehension that copies nested lists: new = [item[:] if isinstance(item, list) else item for item in old]Now This addresses the "termios shallow-copy/restore" concern from the review. The typeahead loss question is still an open design discussion that needs product input. |
Add two new tests using pseudo-terminals to verify real terminal behavior: 1. test_flush_stdin_restores_termios_settings: Verifies that VMIN/VTIME settings in the cc array are properly restored after flush_stdin. This test would have caught the shallow-copy bug. 2. test_flush_stdin_drains_pending_data: Verifies that flush_stdin actually reads and discards pending escape code data from stdin. These tests are skipped on Windows where the pty module is unavailable. Co-authored-by: openhands <openhands@all-hands.dev>
Added PTY-based tests (commit a8b4af9)Added two new tests using pseudo-terminals to verify real terminal behavior: 1.
|
Typeahead Concern - Problem & Proposed DirectionProblem: Constraints:
Rejected approaches:
Proposed direction: Selective flushing
Will prototype this approach after addressing CI failures. |
- Fix line too long in test assertion message - Fix 'possibly unbound' variable warnings by initializing slave_file - Fix pyright type error by extracting lflag with type assertion Co-authored-by: openhands <openhands@all-hands.dev>
The previous implementation drained ALL pending stdin data, which could discard legitimate user typeahead. This update implements selective flushing: - Parse stdin data byte-by-byte, identifying escape sequences - Discard only recognized CSI (\x1b[...) and OSC (\x1b]...) sequences - Preserve all other data in a module-level buffer - Add get_buffered_input() for SDK to retrieve preserved input - Add clear_buffered_input() to explicitly clear the buffer New helper functions: - _is_csi_final_byte(): Check if byte is CSI sequence terminator - _find_csi_end(): Find end of CSI sequence (or detect incomplete) - _find_osc_end(): Find end of OSC sequence with BEL or ST terminator - _parse_stdin_data(): Separate escape sequences from user input Tests added for all parsing functions and the new buffer API. Addresses typeahead concern from PR review: #2245 (comment) Co-authored-by: openhands <openhands@all-hands.dev>
Selective Flushing Implementation (commit 5c34edb)Based on the typeahead concern discussion, I've implemented selective flushing that preserves user input while discarding terminal escape sequences. How It WorksThe new implementation parses stdin data byte-by-byte:
New Public APIfrom openhands.sdk.logger import get_buffered_input, clear_buffered_input
# Retrieve preserved user input (clears buffer)
buffered = get_buffered_input()
# Explicitly clear the buffer without retrieving
clear_buffered_input()SDK Integration PointWhere the SDK reads user input (e.g., before buffered = get_buffered_input()
user_input = buffered.decode('utf-8', errors='replace') + input()Test CoverageAdded comprehensive tests for:
Design Notes
|
|
Next steps, another round of manual testing. |
|
[Automatic Post]: This PR seems to be currently waiting for review. @enyst @juanmichelini, could you please take a look when you have a chance? |
CLI tools like `gh` send terminal query sequences (DSR, OSC 11, etc.) as part of their progress/spinner UI. When running inside a PTY, these queries get captured as output. When displayed, the terminal processes them and responds, causing visible escape code garbage. Root cause analysis: - Diagnostic script confirmed queries are IN the captured PTY output, not generated by terminal responses - The `gh` command writes cursor position and background color queries Solution: - Add filter_terminal_queries() to strip query sequences that trigger terminal responses while preserving legitimate formatting codes - Apply filter in _get_command_output() before returning to visualizer - Queries filtered: DSR, OSC 10/11/4, DA, DA2, DECRQSS - Preserved: ANSI colors, cursor movement, bold/formatting Fixes: #2244 Co-authored-by: openhands <openhands@all-hands.dev>
These scripts helped identify the root cause of #2244: - diagnose_source.py: Proves escape codes are IN captured PTY output - diagnose_leak.py: Tests stdin flushing (first attempted fix) - diagnose_echo.py: Tests echo suppression (second attempted fix) - test_real_world.py: End-to-end test with agent - manual-testing.md: Testing notes and findings Co-authored-by: openhands <openhands@all-hands.dev>
|
📁 PR Artifacts Notice This PR contains a
|
Manual Testing ReportTest Environment
Root Cause DiscoveryThe diagnostic script Finding: The
These get captured by the PTY and when displayed, the terminal responds to them, causing visible garbage. Fix VerificationAfter implementing the filter, ran
Previous Attempts (documented in .pr/)
The diagnostic scripts in |
|
Next steps. Determine if we can reduce the scope of the code committed PR and still resolve the issue. We may end up closing this PR and opening an alternative if we can simplify. Stay tuned. |
|
Next step test and refine #2334 as a more targeted solution. If that PR fixes the problem, we will close this one without merging. |
Summary
Fixes #2244 - Filter terminal query sequences from captured PTY output to prevent visible escape code garbage.
Problem
When CLI tools like
ghrun inside the SDK's PTY, they send terminal query sequences (DSR, OSC 11, etc.) as part of their progress/spinner UI. These queries get captured as output and when displayed, the terminal processes them and responds, causing visible garbage like:Root Cause Analysis
The diagnostic script
.pr/diagnose_source.pyconfirmed the queries are IN the captured PTY output, not generated by terminal responses to Rich rendering. Theghcommand specifically writes:\x1b[6n(DSR - cursor position query)\x1b]11;?(OSC 11 - background color query)Solution
Added
filter_terminal_queries()in the terminal tool to strip query sequences that trigger terminal responses while preserving legitimate formatting codes (colors, bold, etc.):Applied in
_get_command_output()before returning to the visualizer.Manual Testing
Verified with
uv run python .pr/test_real_world.py:ghpreservedFiles Changed
openhands-tools/.../utils/escape_filter.py(NEW) - Filter implementationopenhands-tools/.../terminal_session.py- Apply filter in output processingtests/tools/terminal/test_escape_filter.py(NEW) - 16 tests.pr/- Diagnostic scripts for reviewersAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:704b4a2-pythonRun
All tags pushed for this build
About Multi-Architecture Support
704b4a2-python) is a multi-arch manifest supporting both amd64 and arm64704b4a2-python-amd64) are also available if needed