Speaker Identification Model #22

dscripka · 2023-04-04T00:58:03Z

dscripka
Apr 4, 2023
Maintainer

Issue #21 describes the idea of integrating a speaker identification model into openWakeWord to determine who among a pre-enrolled set of users spoke a given wake word/phrase.

This is an interesting idea, and if it is possible to use the same shared audio embedding backbone as the wake word/phrase detection models, it could be implemented very efficiently. Some initial experiments are proving promising, with the following design choices/caveats:

The detection is text-dependent (e.g., only works for a specific wake word/phrase per speaker detection model)
Each speaker needs to be enrolled with at minimum a few reference audio clips of the specific wake word/phrase, with performance increasing with more reference clips

Some assistance would be useful in evaluating these early prototypes, as real-world testing requires that the same microphone and acoustic environment is used for different speakers to accurately assess performance. Any volunteers would be greatly appreciated!

dalehumby · 2023-04-04T06:47:00Z

dalehumby
Apr 4, 2023

Hi @dscripka – would love to help test this out. How can I help?

0 replies

dscripka · 2023-04-04T12:51:59Z

dscripka
Apr 4, 2023
Maintainer Author

@dalehumby, attached is an example evaluation notebook and ONNX model for the prototype speaker identification model. It isn't a very streamlined process yet, but hopefully it is clear enough to follow. Let me know if you have questions about the notebook, and thanks for your help.

A few comments:

The reference clips of the "hey jarvis" wakeword from the two speakers should be at least ~3 seconds long
Some background noise in the reference clips shouldn't be a problem, but in general performance will likely decrease as the SNR of the wakeword speech decreases

speaker_identification_model_evaluation.zip

0 replies

dalehumby · 2023-04-04T19:16:52Z

dalehumby
Apr 4, 2023

Thanks, I've uploaded to Google Colab and over the next few days will go through it, getting sample data, etc.

When I imported I got an import error for from openwakeword.net import owwModel, it appears that net.py is missing in package and also in the github repo. Doesn't look like ownmodel is used?

For negative data, in https://github.com/dscripka/openWakeWord/blob/main/notebooks/training_models.ipynb shows how to get FSD50k_selected_clips and fma_sample. Where can I get ACAV100M_sample, rnnoise_contributions and BIRD_rirs

OR

I just run the notebook from the "Model Evaluation" section, loading the speaker_id_model_v0.0.1.onnx that you included? (And obviously recording and using my voice files.)

1 reply

dscripka Apr 4, 2023
Maintainer Author

Oops! I uploaded the wrong notebook, sorry about that! I've edited the original message above with a new corrected zip that has the right notebook, which is much simpler and shouldn't give you any import issues.

Though to be safe you can try installing directly from the main branch of the repo, as it has some changes that aren't in tagged releases yet: pip install git+https://github.com/dscripka/openWakeWord.git

And for negative data all you need is your recordings (at least for now). I also have been doing testing with synthetic data, but that is not as reliable and useful as real data for speaker id purposes.

dalehumby · 2023-04-06T13:07:45Z

dalehumby
Apr 6, 2023

Hey, thanks for the updated notebook. I got it working... but I have some not-so-good news.

I recorded myself saying "Hey Jarvis" x5 times in my Jabra 410 speaker in the kitchen (relatively quiet, but with room echo) from about 30 cm away.

np.mean(scores)
0.53323853

I got 2 different TTS engines to say "Hey Jarvis" (2 negative samples total) and the scores are

np.mean(scores)
0.55208874

I'll try get 2-5 samples of another person saying "Hey Jarvis" and see if that works as a negative sample.

Do you have an email where I can send my sound files (if that helps you?)

2 replies

dscripka Apr 6, 2023
Maintainer Author

Hmm, that is a little odd. If you can send the sound files to david.scripka -AT- gmail.com I can do some testing as well.

For the playback of the TTS clips, did you record them on the same Jabra 410 microphone?

dalehumby Apr 6, 2023

Thanks, I've sent you two emails with all the voice samples.

dscripka · 2023-04-07T01:49:57Z

dscripka
Apr 7, 2023
Maintainer Author

Thanks for sharing the data @dalehumby. I've done some more testing and I think the strange performance was (partly) due to a few things:

The TTS data wasn't saved with a 16 khz sample rate. Currently openWakeWord models don't do any conversion on the input files, so providing data with a different sample rate can lead to unpredictable behavior.
The recorded voice samples where a little short (<3 seconds), which can cause issues with detection if the files are fed into openWakeWord to close together due to how the streaming prediction is currently implemented.

After fixing these issues (converting to 16 khz and padding the clips with a few seconds of silence before and after the "hey jarvis" wakeword) I'm getting these average scores:

Dale compared to other Dale clips: 0.72
Dale compared to second speaker: 0.66
Dale compared to TTS clips: 0.37

However, while these scores look reasonable the results may not be consistent in practice unless there is a relatively large number of reference samples, as the average scores for the 2nd speaker you shared are worse:

Second speaker compared to other second speaker clips: 0.66
Second speaker compared to Dale clips: 0.85
Second speaker compared to TTS clips: 0.47

I'll continue experimenting with different modeling and training approaches to see if I can improve performance further.

3 replies

dalehumby Apr 7, 2023

I'm glad you got to the bottom of that. Sorry about the TTS not being the correct sample rate. I forgot about that one.

Regarding the sample length: "Hey Jarvis" is fairly quick to say at normal talking rate. I am not sure I could draw it out to 3 seconds without it sounding odd. Or are you saying there needs to be padding before/after the recording. e.g. 1 second silence, 1 second "Hey Jarvis", 1 second silence? (I was using Rhasspy's built in recording to start/stop recording based on VAD. I can use other software.)

Would more samples help? I can record another 10+ samples of me saying "Hey Jarvis". Or perhaps other phrases? ("What's the weather today", "Hey Mycroft", "Start a 10 minute timer") I can also collect more samples from the second speaker.

dscripka Apr 7, 2023
Maintainer Author

For the sample length, the second interpretation is correct. The wakeword should still be spoken at a normal rate, it's just that a little extra padding of silence is needed to prevent the clips from influencing each other as the "streaming" prediction is really just a sliding window. Hopefully I can make this less of an issue in the future. You don't need to use other software to record, it's easy enough to pad the clips after recording.

More sample clips would be very helpful, thank you for offering! Additional samples from the second speaker would be the most useful so I can determine the practical minimum needed for viable performance, but other phrases would be great as well to ensure that the method is robust to different voice content and speech patterns.

dalehumby Apr 7, 2023

Ahh I understand about the streaming prediction and why the silence helps.

I've emailed a few more files. Happy to record more, as needed.

dscripka · 2023-04-14T01:08:46Z

dscripka
Apr 14, 2023
Maintainer Author

@dalehumby, I've been (slowly) doing some more testing, and have a new set of results. Thanks for sharing the additional data, that was very helpful. Attached is an updated model and evaluation notebook. The model is now recurrent network that operates on the shared audio embeddings, which seems to work a bit better than the fully-connected network.

Using this model, I now get better average distances between a random test clip and a set of reference clips for the "hey jarvis" wakeword:

Speaker 1 (Dale) vs Speaker 1 (Dale) avg. score = 0.92
Speaker 1 vs Speaker 2 avg. score = 0.39
Speaker 1 vs Speaker 3 (myself) avg. score = 0.23
Speaker 2 vs Speaker 2 avg. score = 0.56
Speaker 2 vs Speaker 3 avg. score = 0.41

However, there are some caveats that I've noticed during my testing:

There is variation in performance depending on the specific test clip, and there will be a minority of cases where a clip does not have the highest average score with the correct group.
More reference clips for a given voice increases performance substantially, but may not be practical. Collecting dozens of reference clips in different environments may result in a robust speaker ID model, but most users may not want or be able to conduct such extensive preparation for the model.

Another line of research I'd like to pursue is an online clustering system that would allow the system to collect reference samples as the system is used, which would reduce the up-front effort but may require the user to set a fixed number of speakers, periodically correct samples, etc.

What are your thoughts on this? Does that seem like it would become too complex for the average user, or require too much manual effort be practical?

speaker_id_model_v2.zip

0 replies

dalehumby · 2023-04-15T09:45:08Z

dalehumby
Apr 15, 2023

This is a great improvement. Thanks for continuing to look into this. I've loaded it into Colab and get the same results :)

Would it be helpful if I could send you more samples? From me or speaker 2, or other phrases?

There is variation in performance depending on the specific test clip, and there will be a minority of cases where a clip does not have the highest average score with the correct group.

I think this is to be expected. Even Google (and Siri) doesn't match voice correctly, and e.g. speaker 2 is able to activate Siri on my phone. If a confidence score is sent along with the speaker ID then any downstream application can decide if it wants to use trust speaker ID or not.

More reference clips for a given voice increases performance substantially, but may not be practical. Collecting dozens of reference clips in different environments may result in a robust speaker ID model, but most users may not want or be able to conduct such extensive preparation for the model.

A few things come to mind here:

In Open Wake Word you extend the training set by taking base samples and mixing in noise, echo, etc. Is this a practical solution to extending the positive and negative training sets?

I really like the idea of online training, although I am hesitant about defining a fixed number of speakers ahead of time. Specifically, what happens if there is a guest? Will the speaker ID system force the guests voice into one of the existing clusters? And then in future training pollute the sample set?

I think a relatively short onboarding of new speakers + collecting more samples as it's used + allowing you to label the collected data would be a good user experience. Especially if it's built into e.g. the Rhasspy web UI. For instance, for Rhasspy's Raven wake word you label a wake word and then record 3 samples and click save. This is very fast. It would be nice if there was an option to continue adding samples, so if you wanted to you could add 6... 10, until you got bored.

IIRC, Siri's onboarding asks you to say "Hey Siri" three times, and then "what's the weather today", probably used as a test sample.

A visualisation of the current accuracy could help users see where more data is needed, something like:

There could be a UI that showed wake words with low speaker ID confidence and allow you to label it with the correct speaker (or 'unknown' if you don't know.) And also review/modify all wake word speaker ID labels to correct auto-labeling errors.

2 replies

dscripka Apr 17, 2023
Maintainer Author

Augmenting the reference clips is an interesting idea; with only a few examples there is a high a risk of over-fitting, but given that performance is already promising with this latest version of the model every little bit will help. If you are willing to share more clips (both from you and speaker 2, as well as other phrases) that will certainly help test whether augmenting a small number of clips improves performance for a larger number of other clips. And if Siri already has difficulty distinguishing between these two voices, that's a challenging test set which will be a good indicator of progress.

I'm also concerned about the requirement of a fixed number of speakers for online adaptation, for exactly the reason you mention (forced assignment of new speakers to an existing cluster). I did some tests with clustering methods that can dynamically determine the number of clusters, and while it was able to usually associate new speakers with their own cluster, there was nearly always a total number of clusters greater than the count of unique speakers due to outlier clips.

I like the idea of simple UI that let's user adjust data labels and collect more example clips. I think most of the major commercial voice assistant providers have something similar to this during setup on phones. My one concern here is that collecting example clips in this manner often results in a slightly different distribution of clips compared to actual usage (no background noise, clearer speech, etc.). Though maybe the augmenting approach you mentioned could help here as well.

Thank you for all of your thoughts and assistance (especially the recordings), it's been very helpful! If all of this testing results in a viable new set of features for openWakeWord, I would view that as a great success.

dalehumby Apr 19, 2023

Thanks for the detailed reply. I have sent you an email with a number of new clips, hopefully a larger enough total sample set helps test the idea of augmenting clips with noise. If you need more I am always happy to record. All clips were recorded on Jabra 410 from about 1 m away.

there was nearly always a total number of clusters greater than the count of unique speakers due to outlier clips.

I don't think this is a dealbreaker if there is a UI that shows a label e.g. "unknown2", that can be updated to "dale" which remaps all those clips. Apple Photos face recognition has a similar problem: sometimes it will tag my face as a new face in a photo, but when I relabel that face with my name it will update all the photos with that face.

My one concern here is that collecting example clips in this manner often results in a slightly different distribution of clips compared to actual usage

My thinking here is that all clips are saved, so that over time, during normal usage, the training set grows organically. Users can listen to, retag, delete samples, and retrain the model. (Ideally this would also train the custom verifier, or become the custom verifier, so that wake word detection improves as well as gives a speaker ID.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Identification Model #22

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speaker Identification Model #22

dscripka Apr 4, 2023 Maintainer

Replies: 7 comments · 8 replies

dscripka Apr 4, 2023 Maintainer Author

dscripka Apr 4, 2023 Maintainer Author

dscripka Apr 6, 2023 Maintainer Author

dscripka Apr 7, 2023 Maintainer Author

dscripka Apr 7, 2023 Maintainer Author

dscripka Apr 14, 2023 Maintainer Author

dscripka Apr 17, 2023 Maintainer Author

dscripka
Apr 4, 2023
Maintainer

Replies: 7 comments 8 replies

dscripka
Apr 4, 2023
Maintainer Author

dscripka Apr 4, 2023
Maintainer Author

dscripka Apr 6, 2023
Maintainer Author

dscripka
Apr 7, 2023
Maintainer Author

dscripka Apr 7, 2023
Maintainer Author

dscripka
Apr 14, 2023
Maintainer Author

dscripka Apr 17, 2023
Maintainer Author