Replies: 7 comments 8 replies
-
Hi @dscripka – would love to help test this out. How can I help? |
Beta Was this translation helpful? Give feedback.
-
@dalehumby, attached is an example evaluation notebook and ONNX model for the prototype speaker identification model. It isn't a very streamlined process yet, but hopefully it is clear enough to follow. Let me know if you have questions about the notebook, and thanks for your help. A few comments:
|
Beta Was this translation helpful? Give feedback.
-
Thanks, I've uploaded to Google Colab and over the next few days will go through it, getting sample data, etc. When I imported I got an import error for For negative data, in https://github.com/dscripka/openWakeWord/blob/main/notebooks/training_models.ipynb shows how to get OR I just run the notebook from the "Model Evaluation" section, loading the |
Beta Was this translation helpful? Give feedback.
-
Hey, thanks for the updated notebook. I got it working... but I have some not-so-good news. I recorded myself saying "Hey Jarvis" x5 times in my Jabra 410 speaker in the kitchen (relatively quiet, but with room echo) from about 30 cm away.
I got 2 different TTS engines to say "Hey Jarvis" (2 negative samples total) and the scores are
I'll try get 2-5 samples of another person saying "Hey Jarvis" and see if that works as a negative sample. Do you have an email where I can send my sound files (if that helps you?) |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing the data @dalehumby. I've done some more testing and I think the strange performance was (partly) due to a few things:
After fixing these issues (converting to 16 khz and padding the clips with a few seconds of silence before and after the "hey jarvis" wakeword) I'm getting these average scores: Dale compared to other Dale clips: 0.72 However, while these scores look reasonable the results may not be consistent in practice unless there is a relatively large number of reference samples, as the average scores for the 2nd speaker you shared are worse: Second speaker compared to other second speaker clips: 0.66 I'll continue experimenting with different modeling and training approaches to see if I can improve performance further. |
Beta Was this translation helpful? Give feedback.
-
@dalehumby, I've been (slowly) doing some more testing, and have a new set of results. Thanks for sharing the additional data, that was very helpful. Attached is an updated model and evaluation notebook. The model is now recurrent network that operates on the shared audio embeddings, which seems to work a bit better than the fully-connected network. Using this model, I now get better average distances between a random test clip and a set of reference clips for the "hey jarvis" wakeword: Speaker 1 (Dale) vs Speaker 1 (Dale) avg. score = 0.92 However, there are some caveats that I've noticed during my testing:
Another line of research I'd like to pursue is an online clustering system that would allow the system to collect reference samples as the system is used, which would reduce the up-front effort but may require the user to set a fixed number of speakers, periodically correct samples, etc. What are your thoughts on this? Does that seem like it would become too complex for the average user, or require too much manual effort be practical? |
Beta Was this translation helpful? Give feedback.
-
This is a great improvement. Thanks for continuing to look into this. I've loaded it into Colab and get the same results :) Would it be helpful if I could send you more samples? From me or speaker 2, or other phrases?
I think this is to be expected. Even Google (and Siri) doesn't match voice correctly, and e.g. speaker 2 is able to activate Siri on my phone. If a confidence score is sent along with the speaker ID then any downstream application can decide if it wants to use trust speaker ID or not.
A few things come to mind here: In Open Wake Word you extend the training set by taking base samples and mixing in noise, echo, etc. Is this a practical solution to extending the positive and negative training sets? I really like the idea of online training, although I am hesitant about defining a fixed number of speakers ahead of time. Specifically, what happens if there is a guest? Will the speaker ID system force the guests voice into one of the existing clusters? And then in future training pollute the sample set? I think a relatively short onboarding of new speakers + collecting more samples as it's used + allowing you to label the collected data would be a good user experience. Especially if it's built into e.g. the Rhasspy web UI. For instance, for Rhasspy's Raven wake word you label a wake word and then record 3 samples and click save. This is very fast. It would be nice if there was an option to continue adding samples, so if you wanted to you could add 6... 10, until you got bored. IIRC, Siri's onboarding asks you to say "Hey Siri" three times, and then "what's the weather today", probably used as a test sample. A visualisation of the current accuracy could help users see where more data is needed, something like: There could be a UI that showed wake words with low speaker ID confidence and allow you to label it with the correct speaker (or 'unknown' if you don't know.) And also review/modify all wake word speaker ID labels to correct auto-labeling errors. |
Beta Was this translation helpful? Give feedback.
-
Issue #21 describes the idea of integrating a speaker identification model into openWakeWord to determine who among a pre-enrolled set of users spoke a given wake word/phrase.
This is an interesting idea, and if it is possible to use the same shared audio embedding backbone as the wake word/phrase detection models, it could be implemented very efficiently. Some initial experiments are proving promising, with the following design choices/caveats:
Some assistance would be useful in evaluating these early prototypes, as real-world testing requires that the same microphone and acoustic environment is used for different speakers to accurately assess performance. Any volunteers would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions