Skip to content

Conversation

@L-jasmine
Copy link
Collaborator

No description provided.

@L-jasmine L-jasmine requested a review from Copilot January 22, 2026 19:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements streaming ASR (Automatic Speech Recognition) functionality, enabling real-time speech-to-text conversion with Voice Activity Detection (VAD). The changes introduce a new stream_asr mode that allows incremental ASR results to be sent to clients as speech is detected, rather than waiting for complete audio submission.

Changes:

  • Added streaming ASR support for both Whisper and Paraformer ASR backends with real-time VAD integration
  • Integrated Silero VAD model for server-side speech detection with configurable parameters
  • Refactored session handling to support new WebSocket commands for streaming audio and VAD events

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/util.rs Added bidirectional audio format conversion utilities and RIFF tag validation
src/services/ws/stable/mod.rs Added stream_asr flag and helper methods for sending ASR results and control messages
src/services/ws/stable/asr.rs Implemented streaming ASR methods for Whisper and Paraformer with VAD integration
src/services/ws.rs Added EndVad command and stream_asr parameter support
src/services/mod.rs Added stream_asr parameter to connection query params
src/protocol.rs Added EndVad server event
src/main.rs Added /version endpoint
src/config.rs Added SileroVadconfig for VAD parameters and updated WhisperASRConfig
src/ai/vad.rs Implemented VadSession and VadFactory for Silero VAD integration
src/ai/mod.rs Changed logging from messages array to last_message only
src/ai/bailian/realtime_asr.rs Added semantic_punctuation_enabled parameter and streaming test
Cargo.toml Added silero_vad_burn and burn dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

recv_audio_bytes += data.len();
if !recv_any_asr_result && recv_audio_bytes >= 16000 * 10 {
log::warn!(
"`{}` paraformer asr received more than 30s audio without StartChat, starting automatically",
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks for 10 seconds of audio (16000 * 10 = 160,000 bytes) but the warning message states '30s'. Either the calculation should be 16000 * 30 or the message should say '10s' to match the actual logic.

Suggested change
"`{}` paraformer asr received more than 30s audio without StartChat, starting automatically",
"`{}` paraformer asr received more than 10s audio without StartChat, starting automatically",

Copilot uses AI. Check for mistakes.
src/config.rs Outdated
Comment on lines 246 to 259
pub struct SileroVadconfig {
#[serde(default = "SileroVadconfig::default_threshold")]
pub threshold: f32,
#[serde(default = "SileroVadconfig::default_neg_threshold")]
pub neg_threshold: Option<f32>,
#[serde(default = "SileroVadconfig::default_min_speech_duration_ms")]
pub min_speech_duration_ms: usize,
#[serde(default = "SileroVadconfig::default_max_silence_duration_ms")]
pub max_silence_duration_ms: usize,
#[serde(default = "SileroVadconfig::hangover_ms")]
pub hangover_ms: usize,
}

impl SileroVadconfig {
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type name has inconsistent casing. Should be SileroVadConfig (capital 'C') to follow Rust naming conventions for type names.

Suggested change
pub struct SileroVadconfig {
#[serde(default = "SileroVadconfig::default_threshold")]
pub threshold: f32,
#[serde(default = "SileroVadconfig::default_neg_threshold")]
pub neg_threshold: Option<f32>,
#[serde(default = "SileroVadconfig::default_min_speech_duration_ms")]
pub min_speech_duration_ms: usize,
#[serde(default = "SileroVadconfig::default_max_silence_duration_ms")]
pub max_silence_duration_ms: usize,
#[serde(default = "SileroVadconfig::hangover_ms")]
pub hangover_ms: usize,
}
impl SileroVadconfig {
pub struct SileroVadConfig {
#[serde(default = "SileroVadConfig::default_threshold")]
pub threshold: f32,
#[serde(default = "SileroVadConfig::default_neg_threshold")]
pub neg_threshold: Option<f32>,
#[serde(default = "SileroVadConfig::default_min_speech_duration_ms")]
pub min_speech_duration_ms: usize,
#[serde(default = "SileroVadConfig::default_max_silence_duration_ms")]
pub max_silence_duration_ms: usize,
#[serde(default = "SileroVadConfig::hangover_ms")]
pub hangover_ms: usize,
}
impl SileroVadConfig {

Copilot uses AI. Check for mistakes.
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task_group field is added without explanation or documentation. Consider adding a comment explaining why this field is necessary and what impact it has on the ASR behavior.

Copilot uses AI. Check for mistakes.
.map_err(|_| anyhow::anyhow!("audio_tx closed"))?;

if DEBUG_WAV {
if debug_wav_data.len() > 0 {
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use !debug_wav_data.is_empty() instead of debug_wav_data.len() > 0 for better idiomatic Rust code.

Suggested change
if debug_wav_data.len() > 0 {
if !debug_wav_data.is_empty() {

Copilot uses AI. Check for mistakes.
@L-jasmine L-jasmine merged commit 89b1913 into main Jan 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants