Replies: 6 comments 2 replies
-
Hello @dscripka, did you by chance had the option to take a look on this topic? I think the openWakeWord training would benefit a lot from using a custom, deterministic randomization like I tried above instead of asking the PyTorch data loader to shuffle everything which leads to considerable different training results each time you run the training. I still could not find out what I'm doing wrong in the code above, therefore I would much appreciate your opinion/support on this. Best regards and thanks a lot |
Beta Was this translation helpful? Give feedback.
-
Hey @ab-tools, apologies for the late reply. It's difficult to tell what could be the issue with your code above. Generally, though, if your loss/recall curves look like that usually the issue is with the data preparation. Ultimately, I'm not sure that shuffling of the data is the root cause of training instabilities as if you train for a sufficient number of epochs each data point will be seen multiple times be the model, averaging out any ordering effects. In my tests, the dominant factor for reproducibility seems to be just the network initialization. Have you tried setting the seed for Pytorch, as discussed here: https://discuss.pytorch.org/t/reproducibility-dataloader-shuffle-true-using-seeds/173836? Additionally, another factor to consider is the distribution of labels in each batch. I see in your code above that you have a batch size of 1024, but depending on how many positive and negative examples you have there can be a wide range of class ratios per batch. I find that you need at least 30 positive examples per batch for the model to train well, but this varies substantially depending on the particular data you are using. I'm still actively working on an entirely new training process (hopefully ready to release in the next ~1 week) that automates a wide range of these steps, which should significantly will reduce the amount of effort for training new models. In my tests thus far, this doesn't eliminate the variability across training runs, but it does seem to reduce it to the point where practically it may not matter. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @dscripka, much appreciating you taking the time to look into this! If you say that you will release your new training notebook within just a week anyway and it also reduces variability across training runs, I think I'm just going to wait for this and try again then. Would be great if you could post a reply here as soon as I can give it a try - looking forward to that! Best regards and thanks a lot for all your efforts |
Beta Was this translation helpful? Give feedback.
-
Hello @dscripka, let me first emphasize again how much I appreciate your effort on this and also taking the time to reply to me here again! I see that you reduced the standard clip size from 3 to 2 sec. which I think it's a good idea as most wake words are not that long. I also think it's a great idea to provide prepared sets for negative training and false positive rate estimation as these can surely be used for almost all models in the same way. I do realize though that we still need negative audio samples for augmenting the auto-generated synthetic positive samples and I couldn't hold back with at least trying this part (downloading/preparing the negative samples for augmentation) of your new script directly. ;-) Two remarks about this part:
You did a great job to prepare your new notebook for "full-scale training" and not just a small test bench, making it highly configurable with the YAML file and thus allowing to do a "small test training" (which will be relatively fast) before performing a "full-scale training" which takes a long time. I do think, however, it's a pity that you did not take the same approach also for the negative audio samples: Personally, I would suggest to use the same approach for this step as for the rest of your notebook - make it configurable in the YAML file:
I noticed that the generated 16k WAV audio samples use a pretty strange/big WAV audio format and I wonder if that is required. With your "old" notebook the WAV audio samples were converted with SOX/FFMPEG, I think, and the result was codec "PCM Audio", 16000 hz, 1 channel at 256 kb/s. With your new notebook, however, I get codec "0x0003 (IEEE FLOAT)", 16000 hz, 1 channel at 1024 kb/s - which leads to WAV files being 4 times the size. For the small test set surely no issue, but when downloading the full audio data set, I would prefer not to quadruple the file size, if possible. ;-) For reference, to check the WAV audio format used, I use this little tool here: Again, thanks a lot for your efforts, David |
Beta Was this translation helpful? Give feedback.
-
@ab-tools thank you for taking a look, and for the feedback, this is very helpful! The clip size ( And you are correct about the WAV format, that was a mistake! 16-bit PCM is the desired format, I forgot to make the conversion prior to saving. Full 64-bit float is not necessary and does take up too much space. I have updated the notebook accordingly for the PR. As for 'full-scale' vs 'small-test training', the trade-offs here are difficult. Full downloads of all of the datasets mentioned will take tens of hours to days depending on download speed, require potentially hundreds of GB's of hard-drive space, and depending on hardware could take several days to process. Thus, I didn't want to make that the default behavior. The full datasets are available to download from the exact same links, and very minor changes to the example downloading/processing code. Additionally, depending on the needs of the target deployment environment, some datasets may not even be necessary (e.g., if the user does not intend for the model to handle background music, downloading and using the FMA dataset can be skipped entirely). But I take your point that some users will want to use as much data as possible when training; perhaps I can make a separate (optional) download script for this purpose. |
Beta Was this translation helpful? Give feedback.
-
Perfect, thanks for quickly fixing the WAV format, @dscripka! Regarding the "full-scale" dataset my idea would be to create a common baseline for training: When your script provides a clear baseline, e. g. also which concrete negative audio data should be used for augmentation - because this worked well in your experience - others can try with exactly the same set like yourself and we can collect our results together and perhaps can come up with an even better combination at the end. But this only works if we start with the same set as a baseline. It's totally enough, of course, if you provide this "full-scale" training part in a separate script though. By the way, as I understand there is no need to split between training and validation for negative audio augmentation samples, so I would suggest to simply download the FMA file manually and extracting/converting everything, instead of using this "rudraml/fma" dataset which split things into "training" and "validation". Thanks again |
Beta Was this translation helpful? Give feedback.
-
Hello,
as per discussion here I've set
shuffle = False
in the data loader for training in an attempt to get better reproducible results.But with all negative samples at the start and positive at the end of each epoch results were really bad.
Therefore, I've implemented a custom randomization of the training data which is done only once before training to keep it reproducible:
This seems to work, but looks like I still have a mistake in there as prediction values are not between
0
and1
as expected, but between0.000
and0.001
instead. Maybe you have an idea, @dscripka, what I'm doing wrong here?In the following the full Python code I'm using. Let's start with loading the training data and generating the negative and positive feature sets in a first Python session:
The result of this are two NPY files,
negative_features_random.npy
andpositive_features_random.npy
, containing the feature sets accordingly.In the script above the negative and positive samples are internally already in random order, now we just need to make sure they are also mixed during actual training. I'm therefore preparing the
X
andY
training data source sets (as per the original training notebook) in randomized order like this:Now we have the
X
andY
training feature sets prepared and can just load them for actual training:After training finished I'm plotting the graph and as you can already see here I need to use
0,0.001
as the range for the y axis:The result looks like this:
And just to compare, without my "manual randomization" and leaving
shuffle = True
in the data loader, loss/recall plot looks like this:So it seems that something is clearly wrong (like inverted?) with the training feature sets, but I can't really figure out what I'm doing wrong - must be something super simple as so often. ;-)
Interestingly though, the model does still work "OK". Let's load the exported model again for a test:
And test again a single positive sample:
Again you will notice that y axis range has to be
0,0.001
instead of0,1.
It does recognize the single positive hit in the sample correctly:
And now we calculate the false accept rate:
The result is 12. That's not great, but I also have seen worse. The resulting graph is interesting though:
As you can see this is not only between 0 and 0.001 again, but there is also a constant "base" around 0.0002 for some reason.
For comparison again the resulting graph when run without my "manual randomization" and
shuffle = True
:So it's clear that something is going wrong with my data randomization function above, which I could not figure out yet...
Maybe you have an idea or perhaps even seen something similar before?
Thanks a lot
Andreas
Beta Was this translation helpful? Give feedback.
All reactions