-
Notifications
You must be signed in to change notification settings - Fork 73
Refactor realtime_ws to use built-in silero VAD #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace deprecated HTTP-based vad_url and WebSocket-based vad_realtime_url with the built-in silero_vad_burn library: - Add VadSession to RealtimeSession for local VAD processing - Process audio through silero VAD inline when receiving audio chunks - Detect speech start/end events locally instead of via remote service - Use local VAD in handle_audio_buffer_commit instead of HTTP call - Remove VadRealtimeClient and VadRealtimeRx dependencies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
CodeProt AI is analyzing the changes... ⏳ |
🤖 CodeProt AI - Free Version PR SummaryOverview Technical Summary
Issues Found
🆓 You are using CodeProt AI Free VersionFree Version Features:
Upgrade to Premium Version for:
🌟 Upgrade to Premium Version NowVisit https://codeprot.com/pricing to learn more about features and pricing details. CodeProt AI - Making Code Review Smarter, Making Development More Efficient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the realtime WebSocket service to use built-in silero VAD instead of external HTTP and WebSocket-based VAD services.
Changes:
- Replaced remote VAD service calls with local
VadSessionusingsilero_vad_burnlibrary - Integrated inline VAD processing during audio chunk reception
- Removed deprecated
VadRealtimeClientandVadRealtimeRxdependencies
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/services/realtime_ws.rs
Outdated
| let vad_session = match crate::ai::vad::VadSession::new( | ||
| &config.asr.vad, | ||
| Box::new( | ||
| silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default()) | ||
| .expect("Failed to create silero VAD model"), | ||
| ), | ||
| burn::backend::ndarray::NdArrayDevice::default(), |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using expect with a generic error message can crash the application without providing context. Consider using a more descriptive error message that includes potential causes (e.g., missing model files, insufficient memory) or propagating the error using ? to let the caller handle initialization failures gracefully.
| let vad_session = match crate::ai::vad::VadSession::new( | |
| &config.asr.vad, | |
| Box::new( | |
| silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default()) | |
| .expect("Failed to create silero VAD model"), | |
| ), | |
| burn::backend::ndarray::NdArrayDevice::default(), | |
| let device = burn::backend::ndarray::NdArrayDevice::default(); | |
| let vad_model = match silero_vad_burn::SileroVAD6Model::new(&device) { | |
| Ok(model) => model, | |
| Err(e) => { | |
| log::error!( | |
| "Failed to create silero VAD model: {}. \ | |
| Possible causes include missing or corrupted model files, \ | |
| incompatible hardware, or insufficient memory.", | |
| e | |
| ); | |
| // Drop this WebSocket connection gracefully instead of panicking the server | |
| return; | |
| } | |
| }; | |
| let vad_session = match crate::ai::vad::VadSession::new( | |
| &config.asr.vad, | |
| Box::new(vad_model), | |
| device, |
src/services/realtime_ws.rs
Outdated
| if let Ok(is_speech) = vad_session.detect(chunk) { | ||
| if is_speech { | ||
| speech_detected = true; | ||
| } else if session.triggered && !is_speech { |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition session.triggered && !is_speech in the speech detection logic is redundant. Since the outer condition already checks if is_speech, this branch will never execute when is_speech is true. The logic should check !is_speech in a separate iteration or restructure to properly detect speech end events across chunk boundaries.
| } else if session.triggered && !is_speech { | |
| } else if session.triggered { |
src/services/realtime_ws.rs
Outdated
| speech_detected = true; | ||
| } else if session.triggered && !is_speech { |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop continues processing all chunks even after detecting speech. Once speech_detected is set to true (and !session.triggered is true), the code could break early to avoid unnecessary VAD processing of remaining audio chunks.
| speech_detected = true; | |
| } else if session.triggered && !is_speech { | |
| // Speech started; if we weren't already triggered, mark and stop processing | |
| if !session.triggered { | |
| speech_detected = true; | |
| break; | |
| } | |
| } else if session.triggered { |
| let chunk_size = VadSession::vad_chunk_size(); | ||
| let mut has_speech = false; | ||
| vad_session.reset_state(); | ||
| for chunk in samples_16k.chunks(chunk_size) { | ||
| if let Ok(is_speech) = vad_session.detect(chunk) { | ||
| if is_speech { | ||
| has_speech = true; | ||
| break; |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VAD processing logic (conversion of 24kHz PCM16 to 16kHz f32, chunking, and detection) is duplicated between the inline audio processing (lines 454-500) and the commit handler (lines 668-688). Consider extracting this into a helper function to improve maintainability and ensure consistent behavior.
Remove unnecessary !is_speech check in else-if branch since is_speech is already false when we reach that branch. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace expect() with proper error handling to avoid panics. Now gracefully handles initialization failures with descriptive error messages indicating potential causes like missing model files or insufficient memory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let samples_24k: Vec<f32> = audio_data | ||
| .chunks_exact(2) | ||
| .map(|chunk| { | ||
| i16::from_le_bytes([chunk[0], chunk[1]]) as f32 / i16::MAX as f32 | ||
| }) | ||
| .collect(); |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This audio conversion logic (24kHz PCM16 to f32) is duplicated in three locations: here (lines 462-467), in the handle_audio_buffer_commit function (lines 677-680), and appears twice within the same InputAudioBufferAppend handler. Consider extracting this into a helper function to improve maintainability and reduce duplication.
src/services/realtime_ws.rs
Outdated
| if let Some(vs) = session.vad_session.as_mut() { | ||
| vs.reset_state(); | ||
| } | ||
| break; |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When speech ends (line 478-490), the code resets VAD state after committing the audio buffer. However, if multiple chunks remain in the same audio packet, the break statement prevents them from being processed. This could cause the VAD to miss subsequent speech segments in the same audio append operation, leading to incomplete speech detection.
| if let Some(vs) = session.vad_session.as_mut() { | |
| vs.reset_state(); | |
| } | |
| break; | |
| // Reset VAD state so we can detect subsequent speech segments | |
| vad_session.reset_state(); |
src/services/realtime_ws.rs
Outdated
| // Send speech started event | ||
| let event = ServerEvent::InputAudioBufferSpeechStarted { | ||
| event_id: Uuid::new_v4().to_string(), | ||
| audio_start_ms: 0, |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audio_start_ms is hardcoded to 0, which doesn't reflect the actual timestamp when speech started. This should track the cumulative audio duration processed to provide accurate timing information for the InputAudioBufferSpeechStarted event.
Add audio_position_ms field to RealtimeSession to track cumulative audio duration. The InputAudioBufferSpeechStarted event now reports the actual timestamp when speech was detected instead of hardcoded 0. - Track audio position as chunks are appended (24kHz PCM16 = 48 bytes/ms) - Reset position when buffer is cleared - Log speech start with actual timestamp Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the break statement when speech ends to ensure all chunks in an audio packet are processed. This allows detection of new speech segments that may start after a previous segment ends within the same audio append operation. Refactored to collect VAD results first, then process state transitions to avoid borrow checker conflicts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ); | ||
|
|
||
| // Calculate audio duration: 24kHz PCM16 = 48 bytes per ms | ||
| let chunk_duration_ms = (audio_data.len() / 48) as u32; |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation assumes PCM16 (2 bytes per sample) at 24kHz, which would be 48,000 bytes per second or 48 bytes per millisecond. However, this only works for mono audio. If the audio is stereo, the calculation would be incorrect. The code should explicitly document the assumption about mono audio or handle both cases.
| item_id: Uuid::new_v4().to_string(), | ||
| }; | ||
| let _ = tx.send(event).await; | ||
| } else if !is_speech && session.triggered { |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state transition from speech to non-speech triggers immediately on the first non-speech chunk. This may cause premature cutoff if there are brief pauses during speech. Consider implementing a debounce mechanism or silence threshold (e.g., requiring multiple consecutive non-speech chunks) before triggering the speech end event.
| session.triggered = true; | ||
| let event = ServerEvent::InputAudioBufferSpeechStarted { | ||
| event_id: Uuid::new_v4().to_string(), | ||
| audio_start_ms: session.audio_position_ms, |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audio_start_ms uses the current audio_position_ms, but this represents the end of the current chunk. For accurate speech start timing, it should use the position at the beginning of the chunk where speech was first detected. Consider subtracting chunk_duration_ms or tracking the position before processing the chunk.
| let chunk_size = VadSession::vad_chunk_size(); | ||
| let vad_events: Vec<bool> = samples_16k | ||
| .chunks(chunk_size) | ||
| .filter_map(|chunk| vad_session.detect(chunk).ok()) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using filter_map with .ok() silently ignores VAD detection errors. If errors occur during processing, they won't be logged or handled, potentially causing missed speech detection. Consider logging errors or maintaining error state instead of silently discarding them.
| .filter_map(|chunk| vad_session.detect(chunk).ok()) | |
| .filter_map(|chunk| match vad_session.detect(chunk) { | |
| Ok(is_speech) => Some(is_speech), | |
| Err(e) => { | |
| log::error!("VAD detection error: {}", e); | |
| None | |
| } | |
| }) |
| let samples_24k: Vec<f32> = audio_data | ||
| .chunks_exact(2) | ||
| .map(|chunk| { | ||
| i16::from_le_bytes([chunk[0], chunk[1]]) as f32 / i16::MAX as f32 | ||
| }) | ||
| .collect(); |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PCM16-to-f32 conversion is duplicated in both the real-time processing (lines 471-476) and the commit validation (lines 689-692). Consider extracting this conversion logic into a helper function to reduce code duplication and improve maintainability.
| server_vad | ||
| ); | ||
| session.input_audio_buffer.extend_from_slice(&audio_data); | ||
| session.audio_position_ms += chunk_duration_ms; |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audio_position_ms is incremented after extending the buffer when triggered, but in the else branch (lines 464), it's also incremented. However, when the buffer is cleared (line 460), the audio_position_ms is reset to 0 (line 461) and then incremented by the current chunk duration (line 464). This means the position doesn't account for discarded audio before speech detection. The timing may be inaccurate if audio was buffered and then cleared.
| // Process through VAD to check if there's any speech | ||
| let chunk_size = VadSession::vad_chunk_size(); | ||
| let mut has_speech = false; | ||
| vad_session.reset_state(); |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VAD session state is reset before validating speech in the committed buffer. This reset occurs regardless of whether the real-time VAD is currently active (session.triggered). If speech was ongoing and commit is called, resetting the state could cause inconsistency between the real-time VAD state and the validation check. Consider only resetting when appropriate or maintaining separate VAD instances for real-time and validation.
| vad_session.reset_state(); | |
| // Only reset VAD state when real-time VAD is not currently triggered | |
| if !session.triggered { | |
| vad_session.reset_state(); | |
| } |
Summary
vad_urland WebSocket-basedvad_realtime_urlwith built-insilero_vad_burnlibraryVadSessiontoRealtimeSessionfor local VAD processingVadRealtimeClientandVadRealtimeRxdependenciesTest plan
🤖 Generated with Claude Code