some question about prior_loss #96

UestcJay · 2024-08-27T07:07:19Z

Thanks for your great work. Recently, I am using the hidden_state output from a large language model as the input of the matcha_tts encoder for training. I have fit a sample tens of thousands of times, but the loss is still very large, especially the priority_loss has always been between 1-2. Is there a solution to this problem? ?

UestcJay · 2024-08-27T07:40:57Z

what do you mean..

shivammehta25 · 2024-08-27T14:56:29Z

Is your LLM frozen or are you training any aspect of it?

UestcJay · 2024-08-27T15:01:10Z

yes， I froze my llm, I noticed that your input text is first converted into a phoneme sequence through the phonemizer library and provided to the speech synthesis model, but I directly use the hidden_state output by llm as input; between the two, is the former easier Training? have you ever tried discretizing text directly as input to the model?

shivammehta25 · 2024-08-27T15:10:03Z

I think since your input is not text but already some representation that should be able to capture hidden nuances of phonemization, it should be fine. It is definitely an easier mapping if the input is phonetised but the model should be able to learn. I am actually not sure, why the prior loss is so high. Did you try listening to the outputs of the model is it utter garbage? (prior loss being MSE can be a bit high sometimes)

UestcJay · 2024-08-27T15:22:07Z

Thanks for your such a quick reply! I generated the inference results of the model, which gt is
The Poveys ate all the fish they could and sometimes more than they enjoyed because on his sober days Hollins invariably started his round at the shop, and Constance had to buy for Maggie's sake .
This example is from the training set. The output of the model does not seem to be fully fitted... I have tried the same data and used the text phonetised to train matchatts, and it can be fitted. I also tried increasing the number of training epochs, but the gains were very small.
target.wav_and_model_output.wav.zip

shivammehta25 · 2024-09-03T00:42:31Z

Then, I would have to believe that the hidden representations might not capture what is required to synthesise speech. I am not sure what would be an easy fix to this, perhaps train some part of the output embeddings using LoRA?

intexcor · 2024-12-27T19:14:52Z

Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some question about prior_loss #96

some question about prior_loss #96

UestcJay commented Aug 27, 2024

UestcJay commented Aug 27, 2024

shivammehta25 commented Aug 27, 2024

UestcJay commented Aug 27, 2024

shivammehta25 commented Aug 27, 2024

UestcJay commented Aug 27, 2024

shivammehta25 commented Sep 3, 2024

intexcor commented Dec 27, 2024

some question about prior_loss #96

some question about prior_loss #96

Comments

UestcJay commented Aug 27, 2024

UestcJay commented Aug 27, 2024

shivammehta25 commented Aug 27, 2024

UestcJay commented Aug 27, 2024

shivammehta25 commented Aug 27, 2024

UestcJay commented Aug 27, 2024

shivammehta25 commented Sep 3, 2024

intexcor commented Dec 27, 2024