Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some question about prior_loss #96

Open
UestcJay opened this issue Aug 27, 2024 · 7 comments
Open

some question about prior_loss #96

UestcJay opened this issue Aug 27, 2024 · 7 comments

Comments

@UestcJay
Copy link

Thanks for your great work. Recently, I am using the hidden_state output from a large language model as the input of the matcha_tts encoder for training. I have fit a sample tens of thousands of times, but the loss is still very large, especially the priority_loss has always been between 1-2. Is there a solution to this problem? ?

@UestcJay
Copy link
Author

what do you mean..

@shivammehta25
Copy link
Owner

Is your LLM frozen or are you training any aspect of it?

@UestcJay
Copy link
Author

yes, I froze my llm, I noticed that your input text is first converted into a phoneme sequence through the phonemizer library and provided to the speech synthesis model, but I directly use the hidden_state output by llm as input; between the two, is the former easier Training? have you ever tried discretizing text directly as input to the model?

@shivammehta25
Copy link
Owner

I think since your input is not text but already some representation that should be able to capture hidden nuances of phonemization, it should be fine. It is definitely an easier mapping if the input is phonetised but the model should be able to learn. I am actually not sure, why the prior loss is so high. Did you try listening to the outputs of the model is it utter garbage? (prior loss being MSE can be a bit high sometimes)

@UestcJay
Copy link
Author

Thanks for your such a quick reply! I generated the inference results of the model, which gt is
The Poveys ate all the fish they could and sometimes more than they enjoyed because on his sober days Hollins invariably started his round at the shop, and Constance had to buy for Maggie's sake .
This example is from the training set. The output of the model does not seem to be fully fitted... I have tried the same data and used the text phonetised to train matchatts, and it can be fitted. I also tried increasing the number of training epochs, but the gains were very small.
target.wav_and_model_output.wav.zip

@shivammehta25
Copy link
Owner

Then, I would have to believe that the hidden representations might not capture what is required to synthesise speech. I am not sure what would be an easy fix to this, perhaps train some part of the output embeddings using LoRA?

@intexcor
Copy link

Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@shivammehta25 @UestcJay @intexcor and others