Skip to content

Conversation

@Veronika271
Copy link
Collaborator

I used SPARC to write a script that changes the pitch of the target audio to a specific pitch from the source audio and changes the speaker embeddings of the target audio to the average speaker embeddings of the source audios.
I'm not quite sure how to test this yet, other than that I ran it on an audio sample, and it sounded reasonable.
Thank you!
Veronika

@fabiocat93
Copy link
Collaborator

thank you @Veronika271 , I see some overlapping with the already existing code in here: https://sensein.group/senselab/senselab/audio/tasks/voice_cloning.html

regarding the choice of target speaker embeddings, there are a few different options beyond averaging. For instance, you could

  • select an external target speaker,
  • swap identities within the same dataset (internal speaker conversion),
  • synthesize a new voice entirely.

Did you consider any of these alternatives before deciding to use the average of the source embeddings? If yes, what are your thoughts?

@Veronika271
Copy link
Collaborator Author

@fabiocat93 Thank you for the feedback! I was using the average of the source embeddings because Satra suggested it during a meeting, but I see how selecting an external target speaker would allow me to reuse existing voice-cloning capabilities in Senselab instead. I feel like deciding which type of anonymization to do should depend on the size of the dataset and what Senselab users want in their anonymized samples, from naturalness to pathology biomarkers. Still, I'm happy to rewrite my code using internal speaker anonymization and Senselab's voice cloning feature if that would be better!

@satra
Copy link
Collaborator

satra commented May 2, 2025

i did suggest using target speaker embedding average to create a new speaker. but there are some details here that should be rethought.

  1. the code fabio pointed to is sufficient for an initial pass at voice cloning using sparc, so i would close this PR from that perspective. @Veronika271 - in the future it may be helpful to check through the code to see if something is already there. my bad in not checking either.
  2. it would be nice to consider an API that allows for different types of voice cloning:
  • pairwise (currently implemented in the API)
  • average speaker embedding as a way of creating a new speaker. this could come from multiple targets, or from the source itself.
  • pitch shifts or other changes such as temporal alterations (not clear which models are capable of doing this besides sparc and ppg).

the ppg code calls the last bits neural editing. perhaps we can have a separate api for neural editing that offers more fine grained control of change and only supports models that do that.

@ibevers
Copy link
Collaborator

ibevers commented Jun 10, 2025

@satra @fabiocat93 @Veronika271 do we want to have any default target voices available through senselab so that users don't have to provide their own? This seems like it would be convenient and allow us to provide thoughtful target voice suggestions.

@fabiocat93
Copy link
Collaborator

@satra @fabiocat93 @Veronika271 do we want to have any default target voices available through senselab so that users don't have to provide their own? This seems like it would be convenient and allow us to provide thoughtful target voice suggestions.

This would go back to the original question, "How do you select the target voice?", which we don't have a clear answer to, yet

@ibevers
Copy link
Collaborator

ibevers commented Jun 10, 2025

@satra @fabiocat93 @Veronika271 do we want to have any default target voices available through senselab so that users don't have to provide their own? This seems like it would be convenient and allow us to provide thoughtful target voice suggestions.

This would go back to the original question, "How do you select the target voice?", which we don't have a clear answer to, yet

@fabiocat93 assuming we did have an answer to that question, would we want to provide access to the target voice through senselab?

@fabiocat93
Copy link
Collaborator

@satra @fabiocat93 @Veronika271 do we want to have any default target voices available through senselab so that users don't have to provide their own? This seems like it would be convenient and allow us to provide thoughtful target voice suggestions.

This would go back to the original question, "How do you select the target voice?", which we don't have a clear answer to, yet

@fabiocat93 assuming we did have an answer to that question, would we want to provide access to the target voice through senselab?

If there was an ideal target voice, we could provide a pipeline for downloading the reference speaker embeddings or some of their audio samples as part of the senselab procedure. But 1) we haven't, and 2) the ideal target voice will depend on the user goal and use case (e.g., children vs adult, do we care about emotions? do we care about content in terms of words? do we care about xxx?)

@satra
Copy link
Collaborator

satra commented Jun 10, 2025

instead of providing data, provide access to downloading data+metadata as datasets. many ml packages do that. and that could then be used to provide targets if they wanted to.

@fabiocat93 fabiocat93 marked this pull request as draft August 29, 2025 14:37
@ibevers ibevers added the help wanted Extra attention is needed label Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants