Skip to content

audio-to-text pipeline fails on return_timestamps=word #390

@ad-astra-video

Description

@ad-astra-video

Describe the bug

audio-to-text pipeline is not returning word level timestamps.

@RUFFY-369 is there a way to change to sdpa if word level timestamps is requested without reloading the pipeline to the gpu?

image

Reproduction steps

  1. Download new audio-to-text pipeline with flash attention 2 enabled
  2. Send request to pipeline including return_timestamps=word
    curl -X POST http://172.17.0.1:6666/audio-to-text -F "audio=@test-audio.mp4" -F "model_id=openai/whisper-large-v3" -F "return_timestamps=word"
  3. See error returned
    {"error":{"message":": Error during model execution: WhisperFlashAttention2 attention does not support output_attentions."}}

Expected behaviour

Return word level timestamps.

Severity

None

Screenshots / Live demo link

No response

OS

None

Running on

None

AI-worker version

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions