-
Notifications
You must be signed in to change notification settings - Fork 74
Feat/stream asr #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/stream asr #37
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements streaming ASR (Automatic Speech Recognition) functionality, enabling real-time speech-to-text conversion with Voice Activity Detection (VAD). The changes introduce a new stream_asr mode that allows incremental ASR results to be sent to clients as speech is detected, rather than waiting for complete audio submission.
Changes:
- Added streaming ASR support for both Whisper and Paraformer ASR backends with real-time VAD integration
- Integrated Silero VAD model for server-side speech detection with configurable parameters
- Refactored session handling to support new WebSocket commands for streaming audio and VAD events
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/util.rs | Added bidirectional audio format conversion utilities and RIFF tag validation |
| src/services/ws/stable/mod.rs | Added stream_asr flag and helper methods for sending ASR results and control messages |
| src/services/ws/stable/asr.rs | Implemented streaming ASR methods for Whisper and Paraformer with VAD integration |
| src/services/ws.rs | Added EndVad command and stream_asr parameter support |
| src/services/mod.rs | Added stream_asr parameter to connection query params |
| src/protocol.rs | Added EndVad server event |
| src/main.rs | Added /version endpoint |
| src/config.rs | Added SileroVadconfig for VAD parameters and updated WhisperASRConfig |
| src/ai/vad.rs | Implemented VadSession and VadFactory for Silero VAD integration |
| src/ai/mod.rs | Changed logging from messages array to last_message only |
| src/ai/bailian/realtime_asr.rs | Added semantic_punctuation_enabled parameter and streaming test |
| Cargo.toml | Added silero_vad_burn and burn dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/services/ws/stable/asr.rs
Outdated
| recv_audio_bytes += data.len(); | ||
| if !recv_any_asr_result && recv_audio_bytes >= 16000 * 10 { | ||
| log::warn!( | ||
| "`{}` paraformer asr received more than 30s audio without StartChat, starting automatically", |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks for 10 seconds of audio (16000 * 10 = 160,000 bytes) but the warning message states '30s'. Either the calculation should be 16000 * 30 or the message should say '10s' to match the actual logic.
| "`{}` paraformer asr received more than 30s audio without StartChat, starting automatically", | |
| "`{}` paraformer asr received more than 10s audio without StartChat, starting automatically", |
src/config.rs
Outdated
| pub struct SileroVadconfig { | ||
| #[serde(default = "SileroVadconfig::default_threshold")] | ||
| pub threshold: f32, | ||
| #[serde(default = "SileroVadconfig::default_neg_threshold")] | ||
| pub neg_threshold: Option<f32>, | ||
| #[serde(default = "SileroVadconfig::default_min_speech_duration_ms")] | ||
| pub min_speech_duration_ms: usize, | ||
| #[serde(default = "SileroVadconfig::default_max_silence_duration_ms")] | ||
| pub max_silence_duration_ms: usize, | ||
| #[serde(default = "SileroVadconfig::hangover_ms")] | ||
| pub hangover_ms: usize, | ||
| } | ||
|
|
||
| impl SileroVadconfig { |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type name has inconsistent casing. Should be SileroVadConfig (capital 'C') to follow Rust naming conventions for type names.
| pub struct SileroVadconfig { | |
| #[serde(default = "SileroVadconfig::default_threshold")] | |
| pub threshold: f32, | |
| #[serde(default = "SileroVadconfig::default_neg_threshold")] | |
| pub neg_threshold: Option<f32>, | |
| #[serde(default = "SileroVadconfig::default_min_speech_duration_ms")] | |
| pub min_speech_duration_ms: usize, | |
| #[serde(default = "SileroVadconfig::default_max_silence_duration_ms")] | |
| pub max_silence_duration_ms: usize, | |
| #[serde(default = "SileroVadconfig::hangover_ms")] | |
| pub hangover_ms: usize, | |
| } | |
| impl SileroVadconfig { | |
| pub struct SileroVadConfig { | |
| #[serde(default = "SileroVadConfig::default_threshold")] | |
| pub threshold: f32, | |
| #[serde(default = "SileroVadConfig::default_neg_threshold")] | |
| pub neg_threshold: Option<f32>, | |
| #[serde(default = "SileroVadConfig::default_min_speech_duration_ms")] | |
| pub min_speech_duration_ms: usize, | |
| #[serde(default = "SileroVadConfig::default_max_silence_duration_ms")] | |
| pub max_silence_duration_ms: usize, | |
| #[serde(default = "SileroVadConfig::hangover_ms")] | |
| pub hangover_ms: usize, | |
| } | |
| impl SileroVadConfig { |
| "streaming": "duplex" | ||
| }, | ||
| "payload": { | ||
| "task_group": "audio", |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task_group field is added without explanation or documentation. Consider adding a comment explaining why this field is necessary and what impact it has on the ASR behavior.
src/services/ws.rs
Outdated
| .map_err(|_| anyhow::anyhow!("audio_tx closed"))?; | ||
|
|
||
| if DEBUG_WAV { | ||
| if debug_wav_data.len() > 0 { |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use !debug_wav_data.is_empty() instead of debug_wav_data.len() > 0 for better idiomatic Rust code.
| if debug_wav_data.len() > 0 { | |
| if !debug_wav_data.is_empty() { |
No description provided.