Refactor realtime_ws to use built-in silero VAD #39

juntao · 2026-01-24T03:07:37Z

Summary

Replace deprecated HTTP-based vad_url and WebSocket-based vad_realtime_url with built-in silero_vad_burn library
Add VadSession to RealtimeSession for local VAD processing
Process audio through silero VAD inline when receiving audio chunks
Detect speech start/end events locally instead of via remote service
Remove VadRealtimeClient and VadRealtimeRx dependencies

Test plan

Verify realtime WebSocket endpoint works with local VAD
Test speech detection triggers correctly
Verify ASR is called only when speech is detected

🤖 Generated with Claude Code

Replace deprecated HTTP-based vad_url and WebSocket-based vad_realtime_url with the built-in silero_vad_burn library: - Add VadSession to RealtimeSession for local VAD processing - Process audio through silero VAD inline when receiving audio chunks - Detect speech start/end events locally instead of via remote service - Use local VAD in handle_audio_buffer_commit instead of HTTP call - Remove VadRealtimeClient and VadRealtimeRx dependencies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codeprotapp · 2026-01-24T03:07:43Z

CodeProt AI is analyzing the changes... ⏳

codeprotapp · 2026-01-24T03:07:55Z

🤖 CodeProt AI - Free Version PR Summary

Overview
Migrated from external VAD realtime service to built-in Silero VAD model using the burn framework.
Simplified event handling by removing async VAD event channel and integrating VAD processing directly into audio chunk handling.

Technical Summary

Replaced VadRealtimeClient with VadSession for built-in Silero VAD
Removed external VAD service connection and event channel (realtime_rx)
Integrated VAD detection directly into audio input processing loop
Moved speech start/end detection from event-driven to synchronous processing
Simplified RealtimeEvent enum by removing VadEvent variant
Added VAD session initialization in handle_socket using burn's ndarray backend
Updated error messages and validation logic for VAD availability checks

Issues Found

Potential Memory Leak: SileroVAD6Model is created with Box::new() but ownership transfer to VadSession is unclear. If VadSession::new() fails, the boxed model may leak.
State Management Bug: vad_session.reset_state() is called in handle_audio_buffer_commit() after processing, but the same session continues to be used for streaming detection. This could cause state inconsistency if speech detection spans multiple commits.
Missing Error Handling: In the VAD detection loop within audio processing, errors from vad_session.detect() are silently ignored with if let Ok(). Failed detections should be logged or handled explicitly.
Logic Error: Speech end detection only triggers when session.triggered && !is_speech, but if VAD session is reset mid-stream, the triggered flag may not accurately reflect actual speech state.
Removed Unused Import: Bytes was removed from imports but the code no longer uses it - this is correct cleanup but suggests incomplete refactoring review.
Hardcoded Backend: Using burn::backend::ndarray::NdArrayDevice::default() hardcodes the compute backend. Should be configurable for production use (GPU support, etc.).
Race Condition Risk: VAD session is wrapped in Option<VadSession> and accessed with .as_mut() without synchronization. If WebSocket handler is concurrent, this could cause data races.

🆓 You are using CodeProt AI Free Version

Free Version Features:

✅ Basic code change summary
✅ Simple PR description generation
✅ Basic issue detection and analysis
⚠️ Limited AI token quota
⚠️ Basic functionality experience

Upgrade to Premium Version for:

🚀 Complete PR Description Generation - Detailed change analysis and file-level summaries
🎯 Smart Labels and Classification - Automatic PR type identification and label generation
📊 Advanced Code Review - In-depth code quality analysis and suggestions
🔍 Security Vulnerability Detection - Professional security code scanning
⚡ Unlimited AI Tokens - Unrestricted AI analysis capabilities
🛠️ Custom Configuration - Personalized review rules and templates
🔧 Advanced Issue Detection - Comprehensive bug detection and code quality analysis

🌟 Upgrade to Premium Version Now

Visit https://codeprot.com/pricing to learn more about features and pricing details.

CodeProt AI - Making Code Review Smarter, Making Development More Efficient

Upgrade Now or Contact Support

Copilot

Pull request overview

This PR refactors the realtime WebSocket service to use built-in silero VAD instead of external HTTP and WebSocket-based VAD services.

Changes:

Replaced remote VAD service calls with local VadSession using silero_vad_burn library
Integrated inline VAD processing during audio chunk reception
Removed deprecated VadRealtimeClient and VadRealtimeRx dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-24T03:10:07Z

src/services/realtime_ws.rs

+    let vad_session = match crate::ai::vad::VadSession::new(
+        &config.asr.vad,
+        Box::new(
+            silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default())
+                .expect("Failed to create silero VAD model"),
+        ),
+        burn::backend::ndarray::NdArrayDevice::default(),


Using expect with a generic error message can crash the application without providing context. Consider using a more descriptive error message that includes potential causes (e.g., missing model files, insufficient memory) or propagating the error using ? to let the caller handle initialization failures gracefully.

Suggested change

let vad_session = match crate::ai::vad::VadSession::new(

&config.asr.vad,

Box::new(

silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default())

.expect("Failed to create silero VAD model"),

),

burn::backend::ndarray::NdArrayDevice::default(),

let device = burn::backend::ndarray::NdArrayDevice::default();

let vad_model = match silero_vad_burn::SileroVAD6Model::new(&device) {

Ok(model) => model,

Err(e) => {

log::error!(

"Failed to create silero VAD model: {}. \

Possible causes include missing or corrupted model files, \

incompatible hardware, or insufficient memory.",

e

);

// Drop this WebSocket connection gracefully instead of panicking the server

return;

}

};

let vad_session = match crate::ai::vad::VadSession::new(

&config.asr.vad,

Box::new(vad_model),

device,

Copilot · 2026-01-24T03:10:08Z

src/services/realtime_ws.rs

+                                if let Ok(is_speech) = vad_session.detect(chunk) {
+                                    if is_speech {
+                                        speech_detected = true;
+                                    } else if session.triggered && !is_speech {


The condition session.triggered && !is_speech in the speech detection logic is redundant. Since the outer condition already checks if is_speech, this branch will never execute when is_speech is true. The logic should check !is_speech in a separate iteration or restructure to properly detect speech end events across chunk boundaries.

Suggested change

} else if session.triggered && !is_speech {

} else if session.triggered {

Copilot · 2026-01-24T03:10:08Z

src/services/realtime_ws.rs

+                                        speech_detected = true;
+                                    } else if session.triggered && !is_speech {


The loop continues processing all chunks even after detecting speech. Once speech_detected is set to true (and !session.triggered is true), the code could break early to avoid unnecessary VAD processing of remaining audio chunks.

Suggested change

speech_detected = true;

} else if session.triggered && !is_speech {

// Speech started; if we weren't already triggered, mark and stop processing

if !session.triggered {

speech_detected = true;

break;

}

} else if session.triggered {

Copilot · 2026-01-24T03:10:08Z

src/services/realtime_ws.rs

+        let chunk_size = VadSession::vad_chunk_size();
+        let mut has_speech = false;
+        vad_session.reset_state();
+        for chunk in samples_16k.chunks(chunk_size) {
+            if let Ok(is_speech) = vad_session.detect(chunk) {
+                if is_speech {
+                    has_speech = true;
+                    break;


The VAD processing logic (conversion of 24kHz PCM16 to 16kHz f32, chunking, and detection) is duplicated between the inline audio processing (lines 454-500) and the commit handler (lines 668-688). Consider extracting this into a helper function to improve maintainability and ensure consistent behavior.

Remove unnecessary !is_speech check in else-if branch since is_speech is already false when we reach that branch. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace expect() with proper error handling to avoid panics. Now gracefully handles initialization failures with descriptive error messages indicating potential causes like missing model files or insufficient memory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-24T06:13:28Z

src/services/realtime_ws.rs

+                            let samples_24k: Vec<f32> = audio_data
+                                .chunks_exact(2)
+                                .map(|chunk| {
+                                    i16::from_le_bytes([chunk[0], chunk[1]]) as f32 / i16::MAX as f32
+                                })
+                                .collect();


This audio conversion logic (24kHz PCM16 to f32) is duplicated in three locations: here (lines 462-467), in the handle_audio_buffer_commit function (lines 677-680), and appears twice within the same InputAudioBufferAppend handler. Consider extracting this into a helper function to improve maintainability and reduce duplication.

Copilot · 2026-01-24T06:13:28Z

src/services/realtime_ws.rs

+                                        if let Some(vs) = session.vad_session.as_mut() {
+                                            vs.reset_state();
+                                        }
+                                        break;


When speech ends (line 478-490), the code resets VAD state after committing the audio buffer. However, if multiple chunks remain in the same audio packet, the break statement prevents them from being processed. This could cause the VAD to miss subsequent speech segments in the same audio append operation, leading to incomplete speech detection.

Suggested change

if let Some(vs) = session.vad_session.as_mut() {

vs.reset_state();

}

break;

// Reset VAD state so we can detect subsequent speech segments

vad_session.reset_state();

Copilot · 2026-01-24T06:13:29Z

src/services/realtime_ws.rs

+                                // Send speech started event
+                                let event = ServerEvent::InputAudioBufferSpeechStarted {
+                                    event_id: Uuid::new_v4().to_string(),
+                                    audio_start_ms: 0,


The audio_start_ms is hardcoded to 0, which doesn't reflect the actual timestamp when speech started. This should track the cumulative audio duration processed to provide accurate timing information for the InputAudioBufferSpeechStarted event.

Add audio_position_ms field to RealtimeSession to track cumulative audio duration. The InputAudioBufferSpeechStarted event now reports the actual timestamp when speech was detected instead of hardcoded 0. - Track audio position as chunks are appended (24kHz PCM16 = 48 bytes/ms) - Reset position when buffer is cleared - Log speech start with actual timestamp Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove the break statement when speech ends to ensure all chunks in an audio packet are processed. This allows detection of new speech segments that may start after a previous segment ends within the same audio append operation. Refactored to collect VAD results first, then process state transitions to avoid borrow checker conflicts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-24T06:30:02Z

src/services/realtime_ws.rs

                    );

+                    // Calculate audio duration: 24kHz PCM16 = 48 bytes per ms
+                    let chunk_duration_ms = (audio_data.len() / 48) as u32;


The calculation assumes PCM16 (2 bytes per sample) at 24kHz, which would be 48,000 bytes per second or 48 bytes per millisecond. However, this only works for mono audio. If the audio is stereo, the calculation would be incorrect. The code should explicitly document the assumption about mono audio or handle both cases.

Copilot · 2026-01-24T06:30:02Z

src/services/realtime_ws.rs

+                                        item_id: Uuid::new_v4().to_string(),
+                                    };
+                                    let _ = tx.send(event).await;
+                                } else if !is_speech && session.triggered {


The state transition from speech to non-speech triggers immediately on the first non-speech chunk. This may cause premature cutoff if there are brief pauses during speech. Consider implementing a debounce mechanism or silence threshold (e.g., requiring multiple consecutive non-speech chunks) before triggering the speech end event.

Copilot · 2026-01-24T06:30:03Z

src/services/realtime_ws.rs

+                                    session.triggered = true;
+                                    let event = ServerEvent::InputAudioBufferSpeechStarted {
+                                        event_id: Uuid::new_v4().to_string(),
+                                        audio_start_ms: session.audio_position_ms,


The audio_start_ms uses the current audio_position_ms, but this represents the end of the current chunk. For accurate speech start timing, it should use the position at the beginning of the chunk where speech was first detected. Consider subtracting chunk_duration_ms or tracking the position before processing the chunk.

Copilot · 2026-01-24T06:30:03Z

src/services/realtime_ws.rs

+                            let chunk_size = VadSession::vad_chunk_size();
+                            let vad_events: Vec<bool> = samples_16k
+                                .chunks(chunk_size)
+                                .filter_map(|chunk| vad_session.detect(chunk).ok())


Using filter_map with .ok() silently ignores VAD detection errors. If errors occur during processing, they won't be logged or handled, potentially causing missed speech detection. Consider logging errors or maintaining error state instead of silently discarding them.

Suggested change

.filter_map(|chunk| vad_session.detect(chunk).ok())

.filter_map(|chunk| match vad_session.detect(chunk) {

Ok(is_speech) => Some(is_speech),

Err(e) => {

log::error!("VAD detection error: {}", e);

None

}

})

Copilot · 2026-01-24T06:30:03Z

src/services/realtime_ws.rs

+                            let samples_24k: Vec<f32> = audio_data
+                                .chunks_exact(2)
+                                .map(|chunk| {
+                                    i16::from_le_bytes([chunk[0], chunk[1]]) as f32 / i16::MAX as f32
+                                })
+                                .collect();


This PCM16-to-f32 conversion is duplicated in both the real-time processing (lines 471-476) and the commit validation (lines 689-692). Consider extracting this conversion logic into a helper function to reduce code duplication and improve maintainability.

Copilot · 2026-01-24T06:30:03Z

src/services/realtime_ws.rs

                            server_vad
                        );
                        session.input_audio_buffer.extend_from_slice(&audio_data);
+                        session.audio_position_ms += chunk_duration_ms;


The audio_position_ms is incremented after extending the buffer when triggered, but in the else branch (lines 464), it's also incremented. However, when the buffer is cleared (line 460), the audio_position_ms is reset to 0 (line 461) and then incremented by the current chunk duration (line 464). This means the position doesn't account for discarded audio before speech detection. The timing may be inaccurate if audio was buffered and then cleared.

Copilot · 2026-01-24T06:30:03Z

src/services/realtime_ws.rs

+        // Process through VAD to check if there's any speech
+        let chunk_size = VadSession::vad_chunk_size();
+        let mut has_speech = false;
+        vad_session.reset_state();


The VAD session state is reset before validating speech in the committed buffer. This reset occurs regardless of whether the real-time VAD is currently active (session.triggered). If speech was ongoing and commit is called, resetting the state could cause inconsistency between the real-time VAD state and the validation check. Consider only resetting when appropriate or maintaining separate VAD instances for real-time and validation.

Suggested change

vad_session.reset_state();

// Only reset VAD state when real-time VAD is not currently triggered

if !session.triggered {

vad_session.reset_state();

}

juntao requested review from L-jasmine and Copilot January 24, 2026 03:08

Copilot AI reviewed Jan 24, 2026

View reviewed changes

juntao and others added 2 commits January 24, 2026 06:09

Fix redundant condition in VAD speech detection

d4e74eb

Remove unnecessary !is_speech check in else-if branch since is_speech is already false when we reach that branch. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

juntao requested a review from Copilot January 24, 2026 06:11

Copilot AI reviewed Jan 24, 2026

View reviewed changes

juntao and others added 2 commits January 24, 2026 06:18

juntao requested a review from Copilot January 24, 2026 06:28

Copilot AI reviewed Jan 24, 2026

View reviewed changes

-    let vad_session = match crate::ai::vad::VadSession::new(
-        &config.asr.vad,
-        Box::new(
-            silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default())
-                .expect("Failed to create silero VAD model"),
-        ),
-        burn::backend::ndarray::NdArrayDevice::default(),
+    let device = burn::backend::ndarray::NdArrayDevice::default();
+    let vad_model = match silero_vad_burn::SileroVAD6Model::new(&device) {
+        Ok(model) => model,
+        Err(e) => {
+            log::error!(
+                "Failed to create silero VAD model: {}. \
+                 Possible causes include missing or corrupted model files, \
+                 incompatible hardware, or insufficient memory.",
+                e
+            );
+            // Drop this WebSocket connection gracefully instead of panicking the server
+            return;
+        }
+    };
+    let vad_session = match crate::ai::vad::VadSession::new(
+        &config.asr.vad,
+        Box::new(vad_model),
+        device,

	} else if session.triggered && !is_speech {
	} else if session.triggered {

		speech_detected = true;
		} else if session.triggered && !is_speech {

-                                .filter_map(|chunk| vad_session.detect(chunk).ok())
+                                .filter_map(|chunk| match vad_session.detect(chunk) {
+                                    Ok(is_speech) => Some(is_speech),
+                                    Err(e) => {
+                                        log::error!("VAD detection error: {}", e);
+                                        None
+                                    }
+                                })

Refactor realtime_ws to use built-in silero VAD #39

Are you sure you want to change the base?

Refactor realtime_ws to use built-in silero VAD #39

Uh oh!

Conversation

juntao commented Jan 24, 2026

Summary

Test plan

Uh oh!

codeprotapp bot commented Jan 24, 2026

Uh oh!

codeprotapp bot commented Jan 24, 2026

🤖 CodeProt AI - Free Version PR Summary

🆓 You are using CodeProt AI Free Version

🌟 Upgrade to Premium Version Now

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants