Memory leak: per-request voice tensor reload causes unbounded RSS growth

## Summary

The `generate()` method in `kokoro_v1.py` performs a full voice tensor round-trip on every TTS request: load from `.pt` file → deserialize to `torch.Tensor` → serialize back → write to new temp file. This creates ~20-30MB of transient allocations per request that fragment the Python heap, causing RSS to grow monotonically and never shrink.

## Reproduction

1. Run the CPU container: `docker run -d -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4`
2. Monitor RSS: `docker stats kokoro-tts --no-stream`
3. Send ~50-100 TTS requests over a few hours
4. Observe RSS climbing from ~500MB baseline toward multi-GB without returning

In our case, RSS reached 7.6GB after 2.5 days of moderate use (~50-100 requests/day), triggering the Linux OOM killer on the host.

## Root Cause

In `api/src/inference/kokoro_v1.py`, both `generate()` and `generate_from_tokens()`:

1. Call `paths.load_voice_tensor(voice_path, device)` — reads entire `.pt` file into `BytesIO`, deserializes
2. Call `paths.save_voice_tensor(voice_tensor, temp_path)` — serializes back, writes to NEW temp file
3. This happens **every request**, even when the same voice is used repeatedly

The temp files (`temp_voice_*`) are written to Python's `tempfile.gettempdir()` (system `/tmp`), NOT the app's configured `temp_file_dir`, so the app's `cleanup_temp_files()` never finds or cleans them.

Additionally, `AudioService._writers` in `api/src/services/audio.py` is a class-level dict that accumulates `StreamingAudioWriter` objects on client disconnect or error (the writer key is never removed if `is_last_chunk` is never reached).

## Suggested Fixes

1. **Cache the voice tensor and temp file path** in `KokoroV1` — skip the load/save cycle when the same voice is used again
2. **Use `settings.temp_file_dir`** for all temp files so the cleanup routine can find them
3. **Add a `finally` block** in `AudioService.convert_audio()` to remove the writer key on exception

## Environment

- Image: `ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4`
- Host: 31GB RAM, Linux 6.17.0
- Usage pattern: ~50-100 TTS requests/day via local API calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak: per-request voice tensor reload causes unbounded RSS growth #453

Summary

Reproduction

Root Cause

Suggested Fixes

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Memory leak: per-request voice tensor reload causes unbounded RSS growth #453

Description

Summary

Reproduction

Root Cause

Suggested Fixes

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions