Skip to content

Conversation

@npow
Copy link
Collaborator

@npow npow commented Oct 22, 2025

Problem

Intermittent JSONDecodeError when multiple environments are resolved concurrently during deployment:

json.decoder.JSONDecodeError: Expecting value: line 1 column 86673 (char 86672)

Root Cause

Race condition in FIFO-based IPC between deployer subprocess and parent process:

  1. Writer side: Subprocess writes JSON to FIFO, but Python's buffered I/O may not flush immediately
  2. Reader side: Parent process reads from FIFO in non-blocking mode
  3. Race: When subprocess exits quickly after close(), reader detects process exit and breaks on empty read
  4. Problem: OS kernel may still have buffered data in pipe that hasn't been delivered yet
  5. Result: Truncated JSON at arbitrary positions (~86KB in the error case)

Solution

Changed read_from_fifo_when_ready() to use a hybrid approach:

  1. Start in non-blocking mode (existing behavior)
  • Use select.poll() to wait for data
  • Can detect subprocess failures early
  • Can timeout if subprocess hangs
  1. Switch to blocking mode once first data arrives
  • Use fcntl() to remove O_NONBLOCK flag
  • Continue with blocking read() calls
  • POSIX guarantee: Blocking read() returns EOF (0 bytes) ONLY after writer closes AND all kernel pipe buffers are drained

@nflx-mf-bot
Copy link
Collaborator

Netflix internal testing[1398] @ 7594e0f

# All data read, exit main loop
break
else:
if len(events):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen? So we got some event (like file close?) and no data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If poll() returned an event (any event) AND read() returned 0 bytes then it must be EOF:

  • If it was POLLIN (data ready), then read() would have returned data
  • If read() returned 0 bytes despite an event, it must be POLLHUP (writer closed) or a stale POLLIN that resolved to EOF

@savingoyal savingoyal force-pushed the npow/fix-json-decode-error branch from 7594e0f to ba8226f Compare October 30, 2025 17:30
@npow npow requested review from aquarin and romain-intel November 6, 2025 00:59
@romain-intel romain-intel force-pushed the npow/fix-json-decode-error branch from ba8226f to d946e02 Compare November 7, 2025 10:37
@savingoyal savingoyal enabled auto-merge (squash) November 7, 2025 14:43
@savingoyal savingoyal merged commit 533bf1f into master Nov 7, 2025
32 of 37 checks passed
@savingoyal savingoyal deleted the npow/fix-json-decode-error branch November 7, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants