-
Notifications
You must be signed in to change notification settings - Fork 436
Add a new model: WavLM #1966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add a new model: WavLM #1966
Conversation
|
Great work. I am looking forward to test it. Four quick comments:
Thanks! |
| namespace models { | ||
|
|
||
| struct WavLMOptions { | ||
| // Maximum generation length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planing to use the WavLMOptions structure?
It is not referenced at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, in fact it is not used at this moment.
I tried microsoft/wavlm-large for the test case, which output the last hidden state alone. It may be useful when someone using wavlm plusing a linear layer (language model) training with CTC loss, which outputs token at inferencing stage.
|
Hi, @jordimas
Thanks a lot! |
|
btw, @jordimas It'd need some additional changes to fit those models. I'm wondering whether I should create model templates for each of them, or just change the configs and converters. Thank you for your attention |
Wiill be possible to add one of these on the PR to see exactly how the problem looks like? |
|
Sure |
Well, in CTranslate2, it already has wav2vec2.0 codbase, which can run wav2vec2.0, MMS, parts of omnilingual-asr models (-SSL and -CTC branches), HuBERT (which only differs in training strategy but are the same in backbone model, in the best of my knowledge). However, wavlm has gated relative mechanism to compute the gated position bias in the first attention layer with the pre-layernormed hidden states. After getting the position bias, it will added together with kv matrix just before the softmax operation (computing the attention matrix), and the position bias will be pass to later attention layers without computing it again.
The major changes comparing to wav2vec2.0 C++ codebase can be seen at two files:
src/layers/attention.cc, in which I need to modify the logic insidedot_product_attentionfunction, andsrc/layers/wavlm.cc, where I need to pass one additional object calledposition_bias.I've tested the code and get the last hidden state, computed the cosine similarity with the one of the huggingface wavlm. The result is 1.0. So I think the logic of my codebae is correct.
References: