Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the output format of embedding_model used? #175

Open
elephantpanda opened this issue May 30, 2024 · 3 comments
Open

How is the output format of embedding_model used? #175

elephantpanda opened this issue May 30, 2024 · 3 comments

Comments

@elephantpanda
Copy link

elephantpanda commented May 30, 2024

I am trying to implement the open model using the onnx files provided.

The output of the embedding_model.onnx is (1,1,1,96)
It's input is (from here)
inputs = tf.keras.Input((76, 32, 1)) # melspectrogram shape when provided with 12400 samples at 16 khz

However the input for the different word models are things like (1,16,96) or (1,22,96)

How is this used?

Do we run the embedding model multiple times over the stream?

Then is (1,16,96) mean we take 16 consecutive chunks?

This is confusing me since the audio sample rate is 16000Hz. But the chunk size is 12400 that is 0.75 of a second. 16 chunks would be far too big. Does it use a sliding window?

@dscripka
Copy link
Owner

I agree that it can be a bit confusing. The full implementation of the processing of raw audio data can be found here. At a high-level:

  1. Audio data is processed in 80 ms chunks (by default, but this can actually be changed) and the melspectrogram is computed in a streaming fashion to save computation.
  2. Melspectrograms of a fixed sized are passed to the embedding model, which takes the equivalent of 12400 samples (at 16 khz) at a time. These melspectrograms advance 80 ms at a time, and another set of 12400 samples is computed, which overlaps substantially with the previous set of samples (so yes, it is a sliding window).
  3. The output of the embedding model is a vector of 96 features (corresponding to 80 ms). Depending on the size of the model (16, 22, etc.), these features are stacked in the time dimension until the right size is reached, and then the wakeword model predicts. The whole process then completes again, and a latest feature vector is added to the sliding window.

@elephantpanda
Copy link
Author

elephantpanda commented Jun 13, 2024

Thanks, I managed to figure it out in the end. Thanks for confirming.

BTW two things I found.

(1) 12400 is not enough samples I found about 12600 is enough to get the correct size output for the ONNX.
(2) It is also important to mention that after the spectrogram is calculated it must be normalised by the equation x->x/10+2

After I did those two things it all worked out great. Thanks.

Also, I was wondering how you calculating the spectrogram in a streaming fashion. I assume the spectrogram is most accurate in the centre part, but you wouldn't want to calculate it too often given that it overlaps substantially so that would be wasteful calculations. Any tips on this?

@dscripka
Copy link
Owner

I'm glad you were able to get it working. Point 2 is indeed important, and I agree is a bit confusing. When I was converting the original Google model into separate components, I was never able to reproduce exactly what Google was doing for the melspectrogram, and so that transformation equation was used to make the results at least similar.

As for streaming melspectrogram, you are correct that there is some overlap between the previous and new input audio to ensure the results are still close enough to the non-streaming version, but I've found that it is still more efficient compared to just re-calculating the entire melspectrogram each time. The actual streaming implementation is fairly simple, and is implemented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants