[speechlm2] Add streaming inference pipeline for NemotronVoiceChat#15571
[speechlm2] Add streaming inference pipeline for NemotronVoiceChat#15571erastorgueva-nv wants to merge 35 commits intoNVIDIA-NeMo:mainfrom
Conversation
| # Cache results for future lookups | ||
| if not self.training and self.use_tts_subword_cache: | ||
| valid_embeds = subword_embeds[subword_mask].detach() | ||
| for idx, sid in enumerate(valid_ids): |
Check failure
Code scanning / CodeQL
Potentially uninitialized local variable Error
| "ignore_eos": True, | ||
| "guidance_scale": guidance_scale, | ||
| } | ||
| self.sampling_params = SamplingParams(**default_sampling) |
Check warning
Code scanning / CodeQL
Overwriting attribute in super-class or sub-class Warning
| tts_model_cfg = cfg['model']['speech_generation']['model'] | ||
| tts_model_cfg['pretrained_model'] = None | ||
| tts_model_cfg['pretrained_codec_model'] = None | ||
| except (KeyError, TypeError): |
Check notice
Code scanning / CodeQL
Empty except Note
| """Cleanup on deletion.""" | ||
| try: | ||
| self.shutdown() | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
| # Try to abort cleanly first | ||
| try: | ||
| await self.engine.abort_generation(request_id) | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
| tid = tokenizer.convert_tokens_to_ids(token) | ||
| if isinstance(tid, int): | ||
| special_ids.add(tid) | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
150aab1 to
81a752e
Compare
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
nemo/collections/speechlm2/inference/model_wrappers/model_factory.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/model_wrappers/decode_state.py
Outdated
Show resolved
Hide resolved
examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py
Outdated
Show resolved
Hide resolved
examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/streaming/state/s2s_state.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/streaming/state/s2s_state.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/pipelines/s2s_pipeline_interface.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py
Outdated
Show resolved
Hide resolved
…ing params Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
pzelasko
left a comment
There was a problem hiding this comment.
Partial review up to streaming_s2s_pipeline.py at line 511 (note to self where to pick up later)
| from nemo.collections.speechlm2.inference import S2SPipelineBuilder | ||
|
|
||
| pipeline = S2SPipelineBuilder.build_pipeline(cfg) | ||
| output = pipeline.run(audio_filepaths, options=options) |
There was a problem hiding this comment.
Does this assume a single-turn evaluation? Or the audio file can have multiple turns and the agent is expected to handle that correctly? Let's clarify this in the docs.
There was a problem hiding this comment.
Not sure what you mean - it's full-duplex, so it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.
Or if you're asking about "evaluation" - the code doesn't support detailed "evaluation". We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking). The one bit of "evaluation" we have is WER
| .. code-block:: bash | ||
|
|
||
| python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ | ||
| audio_file=/path/to/audio \ |
There was a problem hiding this comment.
Both examples here showcase audio_file. We need to mention how to perform live streaming inference (using mic or other streaming audio input connector) if it is supported by this API; or that it is not supported.
| from nemo.collections.speechlm2.inference import S2SPipelineBuilder | ||
|
|
||
| pipeline = S2SPipelineBuilder.build_pipeline(cfg) | ||
| output = pipeline.run(audio_filepaths, options=options) |
There was a problem hiding this comment.
Is there another entry-point with a streaming input connector (mic)? We should mention.
| .. code-block:: python | ||
|
|
||
| pipeline.open_session() | ||
| for frames in streamer: |
There was a problem hiding this comment.
Can we show how streamer is constructed? You'd normally refer the user to ASR pipelines documentation but it doesn't exist yet in main IIRC, so we need to describe at least basic concepts / APIs.
|
|
||
| pipeline.open_session() | ||
| for frames in streamer: | ||
| pipeline.generate_step(frames) |
There was a problem hiding this comment.
does this emit some intermediate results? Can we show that?
| asr_predicted_text_strs = self._tokens_to_strings(asr_predicted_tokens) | ||
|
|
||
| logging.info(f'frame {frame_idx}: USER asr: {asr_predicted_text_strs}') | ||
| logging.info(f'frame {frame_idx}: AGENT txt: {predicted_text_strs}') |
There was a problem hiding this comment.
move to logging.debug, this will be spamming the logs a lot
| # infer_one_step sub-stages | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| def _build_input_embedding( |
There was a problem hiding this comment.
It looks like this method should live in DuplexSTT class? It's exposing inner workings of input construction to a high-level inference API.
If we build DuplexSTTv2 which does it completely differently, we don't want to re-write this wrapper - we should just call stt_model.build_input_embedding()
|
|
||
| return emb | ||
|
|
||
| def _run_llm_step( |
There was a problem hiding this comment.
This method should be split to two and live in native / vllm LLM class
|
|
||
|
|
||
| @dataclass | ||
| class PerceptionCUDAGraphState: |
There was a problem hiding this comment.
Should this (partially) live in ASR collection? Could we re-use your work here to accelerate streaming models like nemotron-speech-asr?
| state.static_cache_channel_len_in = cache_last_channel_len.clone() | ||
|
|
||
| logging.info(f" Warming up encoder for CUDA graph capture...") | ||
| for _ in range(3): |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add a streaming (real-time, chunk-by-chunk) inference pipeline for NemotronVoiceChat,
following the same architecture as the NeMo ASR Inference Pipelines.
Collection: speechlm2
Changelog
StreamingS2SPipelinewithgenerate_step()API for both batch file processing and server integrationNemotronVoicechatInferenceWrapperwithinfer_one_step()implementing perception → LLM → TTS → codec decodeS2SPipelineBuilderfactory and Hydra config (s2s_streaming.yaml) for easy setupS2SContextManagerfor decode state lifecycle,S2SStreamingStatefor output accumulations2s_streaming_infer.pyentry script for batch inference on files/manifestsDuplexSTTModel: KV cache support for Nemotron hybrid Mamba/Attention (with monkey-patches for upstream HF bugs),save_pretrainedwith tokenizer export, function head, ASR logit boosts,cache_positionforwardingconftest.pyfixtures, offline-vs-streaming parity test, no-crash config sweepstreaming_inference.rstwith architecture, config reference, and server integration guideModifications to more general code - FYI @kevinhu-nv @Edresson
NemotronVoiceChat:from_pretrainedsupports loading from HF-format checkpoint withllm_artifacts/EarTTSModel: vectorized depth-sum, precomputed RVQ schedule, optionaltorch.compile, subword cache_patch_nemotron_cache_bugsand_patch_nemotron_block_forwardmethods inDuplexSTTModelare patching bugs in the HF Nemotron model code so we can get the KV caching to work. The patches seem to work for me, though I wonder if we can use more up-to-date code that doesn't have the patches.Usage
python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ audio_file=/path/to/audio.wav \ s2s.model_path=/path/to/checkpoint \ s2s.speaker_name="<name>" \ s2s.engine_type="native" \ streaming.chunk_size_in_secs=0.08 \ streaming.buffer_size_in_secs=1.68GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information