-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the output format of embedding_model used? #175
Comments
I agree that it can be a bit confusing. The full implementation of the processing of raw audio data can be found here. At a high-level:
|
Thanks, I managed to figure it out in the end. Thanks for confirming. BTW two things I found. (1) 12400 is not enough samples I found about 12600 is enough to get the correct size output for the ONNX. After I did those two things it all worked out great. Thanks. Also, I was wondering how you calculating the spectrogram in a streaming fashion. I assume the spectrogram is most accurate in the centre part, but you wouldn't want to calculate it too often given that it overlaps substantially so that would be wasteful calculations. Any tips on this? |
I'm glad you were able to get it working. Point 2 is indeed important, and I agree is a bit confusing. When I was converting the original Google model into separate components, I was never able to reproduce exactly what Google was doing for the melspectrogram, and so that transformation equation was used to make the results at least similar. As for streaming melspectrogram, you are correct that there is some overlap between the previous and new input audio to ensure the results are still close enough to the non-streaming version, but I've found that it is still more efficient compared to just re-calculating the entire melspectrogram each time. The actual streaming implementation is fairly simple, and is implemented here. |
I am trying to implement the open model using the onnx files provided.
The output of the embedding_model.onnx is (1,1,1,96)
It's input is (from here)
inputs = tf.keras.Input((76, 32, 1)) # melspectrogram shape when provided with 12400 samples at 16 khz
However the input for the different word models are things like (1,16,96) or (1,22,96)
How is this used?
Do we run the embedding model multiple times over the stream?
Then is (1,16,96) mean we take 16 consecutive chunks?
This is confusing me since the audio sample rate is 16000Hz. But the chunk size is 12400 that is 0.75 of a second. 16 chunks would be far too big. Does it use a sliding window?
The text was updated successfully, but these errors were encountered: