Summary
The generate() method in kokoro_v1.py performs a full voice tensor round-trip on every TTS request: load from .pt file → deserialize to torch.Tensor → serialize back → write to new temp file. This creates ~20-30MB of transient allocations per request that fragment the Python heap, causing RSS to grow monotonically and never shrink.
Reproduction
- Run the CPU container:
docker run -d -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4
- Monitor RSS:
docker stats kokoro-tts --no-stream
- Send ~50-100 TTS requests over a few hours
- Observe RSS climbing from ~500MB baseline toward multi-GB without returning
In our case, RSS reached 7.6GB after 2.5 days of moderate use (~50-100 requests/day), triggering the Linux OOM killer on the host.
Root Cause
In api/src/inference/kokoro_v1.py, both generate() and generate_from_tokens():
- Call
paths.load_voice_tensor(voice_path, device) — reads entire .pt file into BytesIO, deserializes
- Call
paths.save_voice_tensor(voice_tensor, temp_path) — serializes back, writes to NEW temp file
- This happens every request, even when the same voice is used repeatedly
The temp files (temp_voice_*) are written to Python's tempfile.gettempdir() (system /tmp), NOT the app's configured temp_file_dir, so the app's cleanup_temp_files() never finds or cleans them.
Additionally, AudioService._writers in api/src/services/audio.py is a class-level dict that accumulates StreamingAudioWriter objects on client disconnect or error (the writer key is never removed if is_last_chunk is never reached).
Suggested Fixes
- Cache the voice tensor and temp file path in
KokoroV1 — skip the load/save cycle when the same voice is used again
- Use
settings.temp_file_dir for all temp files so the cleanup routine can find them
- Add a
finally block in AudioService.convert_audio() to remove the writer key on exception
Environment
- Image:
ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4
- Host: 31GB RAM, Linux 6.17.0
- Usage pattern: ~50-100 TTS requests/day via local API calls
Summary
The
generate()method inkokoro_v1.pyperforms a full voice tensor round-trip on every TTS request: load from.ptfile → deserialize totorch.Tensor→ serialize back → write to new temp file. This creates ~20-30MB of transient allocations per request that fragment the Python heap, causing RSS to grow monotonically and never shrink.Reproduction
docker run -d -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4docker stats kokoro-tts --no-streamIn our case, RSS reached 7.6GB after 2.5 days of moderate use (~50-100 requests/day), triggering the Linux OOM killer on the host.
Root Cause
In
api/src/inference/kokoro_v1.py, bothgenerate()andgenerate_from_tokens():paths.load_voice_tensor(voice_path, device)— reads entire.ptfile intoBytesIO, deserializespaths.save_voice_tensor(voice_tensor, temp_path)— serializes back, writes to NEW temp fileThe temp files (
temp_voice_*) are written to Python'stempfile.gettempdir()(system/tmp), NOT the app's configuredtemp_file_dir, so the app'scleanup_temp_files()never finds or cleans them.Additionally,
AudioService._writersinapi/src/services/audio.pyis a class-level dict that accumulatesStreamingAudioWriterobjects on client disconnect or error (the writer key is never removed ifis_last_chunkis never reached).Suggested Fixes
KokoroV1— skip the load/save cycle when the same voice is used againsettings.temp_file_dirfor all temp files so the cleanup routine can find themfinallyblock inAudioService.convert_audio()to remove the writer key on exceptionEnvironment
ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4