How is the output format of embedding_model used? #175

elephantpanda · 2024-05-30T15:15:00Z

I am trying to implement the open model using the onnx files provided.

The output of the embedding_model.onnx is (1,1,1,96)
It's input is (from here)
inputs = tf.keras.Input((76, 32, 1)) # melspectrogram shape when provided with 12400 samples at 16 khz

However the input for the different word models are things like (1,16,96) or (1,22,96)

How is this used?

Do we run the embedding model multiple times over the stream?

Then is (1,16,96) mean we take 16 consecutive chunks?

This is confusing me since the audio sample rate is 16000Hz. But the chunk size is 12400 that is 0.75 of a second. 16 chunks would be far too big. Does it use a sliding window?

dscripka · 2024-06-13T00:00:18Z

I agree that it can be a bit confusing. The full implementation of the processing of raw audio data can be found here. At a high-level:

Audio data is processed in 80 ms chunks (by default, but this can actually be changed) and the melspectrogram is computed in a streaming fashion to save computation.
Melspectrograms of a fixed sized are passed to the embedding model, which takes the equivalent of 12400 samples (at 16 khz) at a time. These melspectrograms advance 80 ms at a time, and another set of 12400 samples is computed, which overlaps substantially with the previous set of samples (so yes, it is a sliding window).
The output of the embedding model is a vector of 96 features (corresponding to 80 ms). Depending on the size of the model (16, 22, etc.), these features are stacked in the time dimension until the right size is reached, and then the wakeword model predicts. The whole process then completes again, and a latest feature vector is added to the sliding window.

elephantpanda · 2024-06-13T03:23:21Z

Thanks, I managed to figure it out in the end. Thanks for confirming.

BTW two things I found.

(1) 12400 is not enough samples I found about 12600 is enough to get the correct size output for the ONNX.
(2) It is also important to mention that after the spectrogram is calculated it must be normalised by the equation x->x/10+2

After I did those two things it all worked out great. Thanks.

Also, I was wondering how you calculating the spectrogram in a streaming fashion. I assume the spectrogram is most accurate in the centre part, but you wouldn't want to calculate it too often given that it overlaps substantially so that would be wasteful calculations. Any tips on this?

dscripka · 2024-08-25T00:28:22Z

I'm glad you were able to get it working. Point 2 is indeed important, and I agree is a bit confusing. When I was converting the original Google model into separate components, I was never able to reproduce exactly what Google was doing for the melspectrogram, and so that transformation equation was used to make the results at least similar.

As for streaming melspectrogram, you are correct that there is some overlap between the previous and new input audio to ensure the results are still close enough to the non-streaming version, but I've found that it is still more efficient compared to just re-calculating the entire melspectrogram each time. The actual streaming implementation is fairly simple, and is implemented here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the output format of embedding_model used? #175

How is the output format of embedding_model used? #175

elephantpanda commented May 30, 2024 •

edited

Loading

dscripka commented Jun 13, 2024

elephantpanda commented Jun 13, 2024 •

edited

Loading

dscripka commented Aug 25, 2024

How is the output format of embedding_model used? #175

How is the output format of embedding_model used? #175

Comments

elephantpanda commented May 30, 2024 • edited Loading

dscripka commented Jun 13, 2024

elephantpanda commented Jun 13, 2024 • edited Loading

dscripka commented Aug 25, 2024

elephantpanda commented May 30, 2024 •

edited

Loading

elephantpanda commented Jun 13, 2024 •

edited

Loading