Skip to content

Conversation

@juntao
Copy link
Member

@juntao juntao commented Jan 24, 2026

Summary

  • Replace deprecated HTTP-based vad_url and WebSocket-based vad_realtime_url with built-in silero_vad_burn library
  • Add VadSession to RealtimeSession for local VAD processing
  • Process audio through silero VAD inline when receiving audio chunks
  • Detect speech start/end events locally instead of via remote service
  • Remove VadRealtimeClient and VadRealtimeRx dependencies

Test plan

  • Verify realtime WebSocket endpoint works with local VAD
  • Test speech detection triggers correctly
  • Verify ASR is called only when speech is detected

🤖 Generated with Claude Code

Replace deprecated HTTP-based vad_url and WebSocket-based
vad_realtime_url with the built-in silero_vad_burn library:

- Add VadSession to RealtimeSession for local VAD processing
- Process audio through silero VAD inline when receiving audio chunks
- Detect speech start/end events locally instead of via remote service
- Use local VAD in handle_audio_buffer_commit instead of HTTP call
- Remove VadRealtimeClient and VadRealtimeRx dependencies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codeprotapp
Copy link

codeprotapp bot commented Jan 24, 2026

CodeProt AI is analyzing the changes... ⏳

@codeprotapp
Copy link

codeprotapp bot commented Jan 24, 2026

🤖 CodeProt AI - Free Version PR Summary

Overview
Migrated from external VAD realtime service to built-in Silero VAD model using the burn framework.
Simplified event handling by removing async VAD event channel and integrating VAD processing directly into audio chunk handling.

Technical Summary

  • Replaced VadRealtimeClient with VadSession for built-in Silero VAD
  • Removed external VAD service connection and event channel (realtime_rx)
  • Integrated VAD detection directly into audio input processing loop
  • Moved speech start/end detection from event-driven to synchronous processing
  • Simplified RealtimeEvent enum by removing VadEvent variant
  • Added VAD session initialization in handle_socket using burn's ndarray backend
  • Updated error messages and validation logic for VAD availability checks

Issues Found

  • Potential Memory Leak: SileroVAD6Model is created with Box::new() but ownership transfer to VadSession is unclear. If VadSession::new() fails, the boxed model may leak.
  • State Management Bug: vad_session.reset_state() is called in handle_audio_buffer_commit() after processing, but the same session continues to be used for streaming detection. This could cause state inconsistency if speech detection spans multiple commits.
  • Missing Error Handling: In the VAD detection loop within audio processing, errors from vad_session.detect() are silently ignored with if let Ok(). Failed detections should be logged or handled explicitly.
  • Logic Error: Speech end detection only triggers when session.triggered && !is_speech, but if VAD session is reset mid-stream, the triggered flag may not accurately reflect actual speech state.
  • Removed Unused Import: Bytes was removed from imports but the code no longer uses it - this is correct cleanup but suggests incomplete refactoring review.
  • Hardcoded Backend: Using burn::backend::ndarray::NdArrayDevice::default() hardcodes the compute backend. Should be configurable for production use (GPU support, etc.).
  • Race Condition Risk: VAD session is wrapped in Option<VadSession> and accessed with .as_mut() without synchronization. If WebSocket handler is concurrent, this could cause data races.

🆓 You are using CodeProt AI Free Version

Free Version Features:

  • ✅ Basic code change summary
  • ✅ Simple PR description generation
  • ✅ Basic issue detection and analysis
  • ⚠️ Limited AI token quota
  • ⚠️ Basic functionality experience

Upgrade to Premium Version for:

  • 🚀 Complete PR Description Generation - Detailed change analysis and file-level summaries
  • 🎯 Smart Labels and Classification - Automatic PR type identification and label generation
  • 📊 Advanced Code Review - In-depth code quality analysis and suggestions
  • 🔍 Security Vulnerability Detection - Professional security code scanning
  • Unlimited AI Tokens - Unrestricted AI analysis capabilities
  • 🛠️ Custom Configuration - Personalized review rules and templates
  • 🔧 Advanced Issue Detection - Comprehensive bug detection and code quality analysis

🌟 Upgrade to Premium Version Now

Visit https://codeprot.com/pricing to learn more about features and pricing details.


CodeProt AI - Making Code Review Smarter, Making Development More Efficient

Upgrade Now or Contact Support

@juntao juntao requested review from L-jasmine and Copilot January 24, 2026 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the realtime WebSocket service to use built-in silero VAD instead of external HTTP and WebSocket-based VAD services.

Changes:

  • Replaced remote VAD service calls with local VadSession using silero_vad_burn library
  • Integrated inline VAD processing during audio chunk reception
  • Removed deprecated VadRealtimeClient and VadRealtimeRx dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 97 to 103
let vad_session = match crate::ai::vad::VadSession::new(
&config.asr.vad,
Box::new(
silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default())
.expect("Failed to create silero VAD model"),
),
burn::backend::ndarray::NdArrayDevice::default(),
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using expect with a generic error message can crash the application without providing context. Consider using a more descriptive error message that includes potential causes (e.g., missing model files, insufficient memory) or propagating the error using ? to let the caller handle initialization failures gracefully.

Suggested change
let vad_session = match crate::ai::vad::VadSession::new(
&config.asr.vad,
Box::new(
silero_vad_burn::SileroVAD6Model::new(&burn::backend::ndarray::NdArrayDevice::default())
.expect("Failed to create silero VAD model"),
),
burn::backend::ndarray::NdArrayDevice::default(),
let device = burn::backend::ndarray::NdArrayDevice::default();
let vad_model = match silero_vad_burn::SileroVAD6Model::new(&device) {
Ok(model) => model,
Err(e) => {
log::error!(
"Failed to create silero VAD model: {}. \
Possible causes include missing or corrupted model files, \
incompatible hardware, or insufficient memory.",
e
);
// Drop this WebSocket connection gracefully instead of panicking the server
return;
}
};
let vad_session = match crate::ai::vad::VadSession::new(
&config.asr.vad,
Box::new(vad_model),
device,

Copilot uses AI. Check for mistakes.
if let Ok(is_speech) = vad_session.detect(chunk) {
if is_speech {
speech_detected = true;
} else if session.triggered && !is_speech {
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition session.triggered && !is_speech in the speech detection logic is redundant. Since the outer condition already checks if is_speech, this branch will never execute when is_speech is true. The logic should check !is_speech in a separate iteration or restructure to properly detect speech end events across chunk boundaries.

Suggested change
} else if session.triggered && !is_speech {
} else if session.triggered {

Copilot uses AI. Check for mistakes.
Comment on lines 471 to 472
speech_detected = true;
} else if session.triggered && !is_speech {
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop continues processing all chunks even after detecting speech. Once speech_detected is set to true (and !session.triggered is true), the code could break early to avoid unnecessary VAD processing of remaining audio chunks.

Suggested change
speech_detected = true;
} else if session.triggered && !is_speech {
// Speech started; if we weren't already triggered, mark and stop processing
if !session.triggered {
speech_detected = true;
break;
}
} else if session.triggered {

Copilot uses AI. Check for mistakes.
Comment on lines +678 to +685
let chunk_size = VadSession::vad_chunk_size();
let mut has_speech = false;
vad_session.reset_state();
for chunk in samples_16k.chunks(chunk_size) {
if let Ok(is_speech) = vad_session.detect(chunk) {
if is_speech {
has_speech = true;
break;
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VAD processing logic (conversion of 24kHz PCM16 to 16kHz f32, chunking, and detection) is duplicated between the inline audio processing (lines 454-500) and the commit handler (lines 668-688). Consider extracting this into a helper function to improve maintainability and ensure consistent behavior.

Copilot uses AI. Check for mistakes.
juntao and others added 2 commits January 24, 2026 06:09
Remove unnecessary !is_speech check in else-if branch since
is_speech is already false when we reach that branch.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace expect() with proper error handling to avoid panics.
Now gracefully handles initialization failures with descriptive
error messages indicating potential causes like missing model
files or insufficient memory.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@juntao juntao requested a review from Copilot January 24, 2026 06:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +462 to +467
let samples_24k: Vec<f32> = audio_data
.chunks_exact(2)
.map(|chunk| {
i16::from_le_bytes([chunk[0], chunk[1]]) as f32 / i16::MAX as f32
})
.collect();
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This audio conversion logic (24kHz PCM16 to f32) is duplicated in three locations: here (lines 462-467), in the handle_audio_buffer_commit function (lines 677-680), and appears twice within the same InputAudioBufferAppend handler. Consider extracting this into a helper function to improve maintainability and reduce duplication.

Copilot uses AI. Check for mistakes.
Comment on lines 487 to 490
if let Some(vs) = session.vad_session.as_mut() {
vs.reset_state();
}
break;
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When speech ends (line 478-490), the code resets VAD state after committing the audio buffer. However, if multiple chunks remain in the same audio packet, the break statement prevents them from being processed. This could cause the VAD to miss subsequent speech segments in the same audio append operation, leading to incomplete speech detection.

Suggested change
if let Some(vs) = session.vad_session.as_mut() {
vs.reset_state();
}
break;
// Reset VAD state so we can detect subsequent speech segments
vad_session.reset_state();

Copilot uses AI. Check for mistakes.
// Send speech started event
let event = ServerEvent::InputAudioBufferSpeechStarted {
event_id: Uuid::new_v4().to_string(),
audio_start_ms: 0,
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audio_start_ms is hardcoded to 0, which doesn't reflect the actual timestamp when speech started. This should track the cumulative audio duration processed to provide accurate timing information for the InputAudioBufferSpeechStarted event.

Copilot uses AI. Check for mistakes.
juntao and others added 2 commits January 24, 2026 06:18
Add audio_position_ms field to RealtimeSession to track cumulative
audio duration. The InputAudioBufferSpeechStarted event now reports
the actual timestamp when speech was detected instead of hardcoded 0.

- Track audio position as chunks are appended (24kHz PCM16 = 48 bytes/ms)
- Reset position when buffer is cleared
- Log speech start with actual timestamp

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the break statement when speech ends to ensure all chunks
in an audio packet are processed. This allows detection of new
speech segments that may start after a previous segment ends
within the same audio append operation.

Refactored to collect VAD results first, then process state
transitions to avoid borrow checker conflicts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@juntao juntao requested a review from Copilot January 24, 2026 06:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

);

// Calculate audio duration: 24kHz PCM16 = 48 bytes per ms
let chunk_duration_ms = (audio_data.len() / 48) as u32;
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation assumes PCM16 (2 bytes per sample) at 24kHz, which would be 48,000 bytes per second or 48 bytes per millisecond. However, this only works for mono audio. If the audio is stereo, the calculation would be incorrect. The code should explicitly document the assumption about mono audio or handle both cases.

Copilot uses AI. Check for mistakes.
item_id: Uuid::new_v4().to_string(),
};
let _ = tx.send(event).await;
} else if !is_speech && session.triggered {
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state transition from speech to non-speech triggers immediately on the first non-speech chunk. This may cause premature cutoff if there are brief pauses during speech. Consider implementing a debounce mechanism or silence threshold (e.g., requiring multiple consecutive non-speech chunks) before triggering the speech end event.

Copilot uses AI. Check for mistakes.
session.triggered = true;
let event = ServerEvent::InputAudioBufferSpeechStarted {
event_id: Uuid::new_v4().to_string(),
audio_start_ms: session.audio_position_ms,
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audio_start_ms uses the current audio_position_ms, but this represents the end of the current chunk. For accurate speech start timing, it should use the position at the beginning of the chunk where speech was first detected. Consider subtracting chunk_duration_ms or tracking the position before processing the chunk.

Copilot uses AI. Check for mistakes.
let chunk_size = VadSession::vad_chunk_size();
let vad_events: Vec<bool> = samples_16k
.chunks(chunk_size)
.filter_map(|chunk| vad_session.detect(chunk).ok())
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using filter_map with .ok() silently ignores VAD detection errors. If errors occur during processing, they won't be logged or handled, potentially causing missed speech detection. Consider logging errors or maintaining error state instead of silently discarding them.

Suggested change
.filter_map(|chunk| vad_session.detect(chunk).ok())
.filter_map(|chunk| match vad_session.detect(chunk) {
Ok(is_speech) => Some(is_speech),
Err(e) => {
log::error!("VAD detection error: {}", e);
None
}
})

Copilot uses AI. Check for mistakes.
Comment on lines +471 to +476
let samples_24k: Vec<f32> = audio_data
.chunks_exact(2)
.map(|chunk| {
i16::from_le_bytes([chunk[0], chunk[1]]) as f32 / i16::MAX as f32
})
.collect();
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PCM16-to-f32 conversion is duplicated in both the real-time processing (lines 471-476) and the commit validation (lines 689-692). Consider extracting this conversion logic into a helper function to reduce code duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
server_vad
);
session.input_audio_buffer.extend_from_slice(&audio_data);
session.audio_position_ms += chunk_duration_ms;
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audio_position_ms is incremented after extending the buffer when triggered, but in the else branch (lines 464), it's also incremented. However, when the buffer is cleared (line 460), the audio_position_ms is reset to 0 (line 461) and then incremented by the current chunk duration (line 464). This means the position doesn't account for discarded audio before speech detection. The timing may be inaccurate if audio was buffered and then cleared.

Copilot uses AI. Check for mistakes.
// Process through VAD to check if there's any speech
let chunk_size = VadSession::vad_chunk_size();
let mut has_speech = false;
vad_session.reset_state();
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VAD session state is reset before validating speech in the committed buffer. This reset occurs regardless of whether the real-time VAD is currently active (session.triggered). If speech was ongoing and commit is called, resetting the state could cause inconsistency between the real-time VAD state and the validation check. Consider only resetting when appropriate or maintaining separate VAD instances for real-time and validation.

Suggested change
vad_session.reset_state();
// Only reset VAD state when real-time VAD is not currently triggered
if !session.triggered {
vad_session.reset_state();
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants