How to achieve live transcription #19

csukuangfj · 2024-10-22T08:51:24Z

The title of the paper https://arxiv.org/pdf/2410.15608 is

Moonshine: Speech Recognition for Live Transcription and Voice Commands

However, the model is a non-streaming model, could you describe how to achieve live transcription?

The demo in this repo is for decoding files,
it would be nice if you can provide a demo for live transcriptions.

info-wordcab · 2024-10-22T16:42:44Z

Definitely would be interesting to see a live transcription demo!

evmaki · 2024-10-22T18:24:08Z

Thanks for the request! We are working on a live transcription demo that we will add to the repo soon.

andimarafioti · 2024-10-22T18:33:09Z

I know it's not exactly what you asked for, for I added the model to Hugging Face's speech to speech library, so it does live translation

sleepingcat4 · 2024-10-25T20:11:35Z

@evmaki I tried to ran a benchmark and results were very disappointing. I had used below specifications:

JAX backend
10s splits
percussive component is used for transcription
librosa to remove silence (trim func)
tiny variation of moonshine was used

in the benchmark only moonshine was loaded and used for transcriptions on the audio segments. rest were preprocessing steps. result was 48 minutes to transcribe 1 hour 28 minutes long video on CPU.

I ran a benchmark using Faster-whisper tiny model on CPU with below specifications

CPU backend
int8
same preprocessing steps
10s segments was fed at a given moment
same func as moonshine was used except model was changed

It took under 8 minutes to transcribe that same 1 hour 28 minutes audio file. Even I liked faster Whisper's tiny model quality of transcriptions better.

provided these results how you guys are promising live-transcriptions?

evmaki · 2024-10-25T20:32:19Z

@sleepingcat4 thanks for benchmarking and sharing your results. The Keras implementation currently has some speed issues, which is what's causing this. We've added ONNX models that run much faster. I encourage people to try those out.

Re: live transcriptions. We'll soon be merging a demo (using the ONNX models) that shows live captioning in action. The branch is already public if you want to check it out. Demo script is located here.

PS: What CPU are you running on? And are you willing to share your script so we can reproduce your results?

csukuangfj · 2024-10-26T00:37:28Z

@evmaki

Thank you for sharing the demo!

By the way, for the following two lines:

moonshine/moonshine/demo/live_captions.py

Line 153 in 5689bdf

speech = np.concatenate((speech, chunk))

moonshine/moonshine/demo/live_captions.py

Line 179 in 5689bdf

print_captions(transcribe(speech))

they result in redundant computation, which is not efficient.

Is there a plan to release a streaming model?

sleepingcat4 · 2024-10-26T05:42:35Z

@evmaki below is the function I used to transcribe 10s segement at a moment.

def transcribe_folder(audio_folder, output_file, model='moonshine/tiny'):
    initial_time = time.time()

    transcription = ""
    audio_files = sorted(
        [f for f in os.listdir(audio_folder) if f.endswith('.wav')],
        key=lambda x: int(re.search(r'part(\d+)\.wav', x).group(1))
    )

    for i, file_name in enumerate(audio_files):
        file_path = os.path.join(audio_folder, file_name)
        transcript = moonshine.transcribe(file_path, model)
        transcript = " ".join(transcript)

        seg_st_time = i * 10
        seg_en_time = seg_st_time + 10

        start_h = seg_st_time // 3600
        start_m = (seg_st_time % 3600) // 60
        start_s = seg_st_time % 60

        end_h = seg_en_time // 3600
        end_m = (seg_en_time % 3600) // 60
        end_s = seg_en_time % 60

        transcription += f"{start_h:02}:{start_m:02}:{start_s:02}-{end_h:02}:{end_m:02}:{end_s:02}s: {transcript}\n"

    with open(output_file, 'w') as f:
        f.write(transcription)

    end_time = time.time()
    execution_time = (end_time - initial_time) / 60
    print(f"Execution Time: {execution_time:.2f} minutes")
    print(f"Transcription saved to: {output_file}")

transcribe_folder("Yj7ZDcHGtK", output_file="bench_script.txt")

csukuangfj · 2024-10-27T03:26:05Z

@evmaki I tried to ran a benchmark and results were very disappointing. I had used below specifications:

JAX backend

10s splits

percussive component is used for transcription

librosa to remove silence (trim func)

tiny variation of moonshine was used

in the benchmark only moonshine was loaded and used for transcriptions on the audio segments. rest were preprocessing steps. result was 48 minutes to transcribe 1 hour 28 minutes long video on CPU.

I ran a benchmark using Faster-whisper tiny model on CPU with below specifications

CPU backend

int8

same preprocessing steps

10s segments was fed at a given moment

same func as moonshine was used except model was changed

It took under 8 minutes to transcribe that same 1 hour 28 minutes audio file. Even I liked faster Whisper's tiny model quality of transcriptions better.

provided these results how you guys are promising live-transcriptions?

@sleepingcat4

I suggest that you test it again with sherpa-onnx, which has supported Moonshine models.

The following colab notebook guides you step-by-step how to do that.

https://github.com/k2-fsa/colab/blob/master/sherpa-onnx/RTF_comparison_betwen_whisper_and_moonshine.ipynb

csukuangfj · 2024-10-27T03:29:20Z

By the way, if you want to try Moonshine models on Android, you can download pre-built APKs from
https://k2-fsa.github.io/sherpa/onnx/vad/apk-asr.html

E.g., the following links:

keveman · 2024-10-27T05:31:26Z

Thanks so much @csukuangfj, for providing the notebook, and the comparing with Whisper. Just to summarize, Moonshine tiny can transcribe 335.2 seconds of audio in 19.6 seconds whereas whisper tiny.en needs 64.1 seconds. That's a 3.3x speedup, and the transcription quality looks identical. @sleepingcat4 I hope this helps alleviate some of your disappointment. The provided Keras implementation is indeed non-optimal and we released it to be a reference with future enhancement in mind. Most deployments we had in mind (such as on SBCs) would never have Torch or TF or JAX or other such large frameworks installed. ONNX, TFLite or some other homegrown runtimes are best suitable for seeing the benefits of Moonshine.

sleepingcat4 · 2024-10-27T05:46:40Z

@csukuangfj thanks for the notebook. I was writing an audio transcription pipeline so I needed to have benchmarks on large audio before to get started. results on your notebook looks interesting (my benchmark avoided ONNX models) but I don't think moonshine is still a good fit for making my dataset or pipeline.

@keveman Moonshine's transcription quality is par to OpenAI Whipser model but it not exactly at the same level with faster-whisper tiny int8 model. Faster-whisper can capture more nuanced information. If moonshine can match both the speed of tiny whisper on faster-whipser library, I would definitely integrate it into my pipeline.

But, I will look into the ONNX models and few other models as well. Because from LAION AI, we are planning to maintain an open-source repo to share our pipeline and benchmark models and share them with everyone. Meanwhile really thanks for the responses and help.

keveman · 2024-10-27T05:53:05Z

@sleepingcat4 faster-whisper and OpenAI whisper are the same underlying models, so I am not sure what you mean by getting better quality with faster-whisper. In any case, we do have a PR out for running moonshine with CTranslate2, which is what faster-whisper is based on.

sleepingcat4 · 2024-10-27T05:58:21Z

@keveman Yes, the models are same but faster whipser allows int8 for even tiny whisper and the CTranslate2 is doing rest of the speed-up I think. I meant Whisper tiny does a better transcription job than moonshine. I went through both the transcriptions for the benchmark I did and whisper-tiny int8 was able to capture last 1,2 lines that moonshine didn't.

And for me those lines are important. I am also going to run a test with sensevoice maybe that can beat whisper. But, Moonshine right now is third on my list.

guynich · 2024-10-27T10:32:51Z

@sleepingcat4 thank you for checking the branch code! I'm authoring the live captions script. To compare this script, (which in current form provides a console user interface that emulates streaming-style live captions with very frequent console refreshes), with faster-whisper speech chunking method requires script changes. To my knowledge they both use the silero-vad iterator class. I'll come back to this issue thread with a summary of suggestions after checking.

edited: replaced GUI with console user interface.

guynich · 2024-10-27T13:49:49Z

@sleepingcat4
Faster-whisper transcriber includes methods to transcribe segments of speech into text. It uses code adapted from silero-vad, see:
https://github.com/SYSTRAN/faster-whisper/blob/c2a1da1bd94e002c38487c91c2f6b50a048000cf/faster_whisper/vad.py#L13C1-L13C73.
I was mistaken in thinking faster-whisper uses the VADIterator class though I do see similarities in faster-whisper’s adapted code. Moonshine repo does not have currently an equivalent speech segmentation method.

A suggestion to directly compare faster-whisper and Moonshine models when testing on audio file examples is to adapt faster-whisper code for use with Moonshine models., e.g.: adapt code from
https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py.
We don’t have currently a plan for this in our repo.

For our open-source live_captions.py demo script I chose to use silero-vad native VADIterator class for speech chunking. This provides simpler code for our standalone demo script and uses the same VAD model as faster-whisper.

Here are suggestions how to adapt live_captions.py demo script to be more similar to faster-whisper speech segmentation.

Disable refresh model inferences by commenting out this line and if statement.

moonshine/moonshine/demo/live_captions.py

Line 177 in 962089a

print_captions(transcribe(speech))

This will block most of the streaming-style caption updates from the console (which currently we want in our demo) and thus will save computation as already commented by @csukuangfj above, e.g.:
How to achieve live transcription #19 (comment)
Set MAX_SPEECH_SECS = 30 to be same as Whisper model context window.
Copy over relevant Silero VAD model options used in faster-whisper code,
e.g.: from here
https://github.com/SYSTRAN/faster-whisper/blob/c2a1da1bd94e002c38487c91c2f6b50a048000cf/faster_whisper/vad.py#L14
to here:

moonshine/moonshine/demo/live_captions.py

Line 118 in 962089a

vad_iterator = VADIterator(

I chose not to specify all VADIterator parameters to provide simpler code for our standalone demo script. I think the most significant difference is my current script uses min_silence_duration_ms=300 and not the default value of 2000 milliseconds used in faster-whisper vad.py. In our demo testing I’ve seen this lower value of 300 milliseconds is more responsive for microphone captured speech with more frequent detection of speech pauses and more frequent commits to the cache. Interested people should try experimenting with this value and the other available parameters in the silero-vad VADIterator class:
https://github.com/snakers4/silero-vad/blob/e531cd3462189f275e0a231f0cefb21816147ed2/src/silero_vad/utils_vad.py#L394

I hope this information helps and thank you for your comments.

sleepingcat4 · 2024-10-27T13:56:30Z

@guynich what do you mean by Chunking? (in this context)

like if I even consider 10s segment transcriptions for both moonshine and faster-whisper, faster-whisper tiny int8 model does a better job than moonshine tiny.

also moonshine in my experiments does worse if I don't use percussive component and rather feed the original audio file. Even when harmonic component (background noise is removed) moonshine's transcriptions are slightly bad than whisper tiny int8 (beam_size=5).

so while I am interested into the moonshine model but it doesn't make sense to use it for production or Dataset generation pipeline. (I didn't use ONNX yet)

guynich · 2024-10-27T20:40:43Z

@sleepingcat4

@guynich what do you mean by Chunking? (in this context)

By chunking I mean speech segmentation as done in faster-whisper transcribe() method here.
https://github.com/SYSTRAN/faster-whisper/blob/c2a1da1bd94e002c38487c91c2f6b50a048000cf/faster_whisper/transcribe.py#L437
I believe this means even shorter clips like 10 seconds may be divided into shorter segments depending on talker's pauses.

Thanks for your comments.

guynich · 2024-10-27T22:27:43Z

I've merged the live_captions.py demo script today with help from @keveman and @evmaki. I'm removing the redundant branch guy/live_captions now that this work is merged.

curiositry · 2024-10-28T00:41:32Z

Thanks for this @guynich!

Do I need to do anything to adjust my system sample rate (48kHz) for this to work?

I'm getting:

(env_moonshine) $ python3 moonshine/moonshine/demo/live_captions.py
Loading Moonshine model 'moonshine/base' (using ONNX runtime) ...
Expression 'paInvalidSampleRate' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2048
Expression 'PaAlsaStreamComponent_InitialConfigure( &self->capture, inParams, self->primeBuffers, hwParamsCapture, &realSr )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2718
Expression 'PaAlsaStream_Configure( stream, inputParameters, outputParameters, sampleRate, framesPerBuffer, &inputLatency, &outputLatency, &hostBufferSizeMode )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2842
Traceback (most recent call last):
  File "/path/to/moonshine/moonshine/moonshine/demo/live_captions.py", line 126, in <module>
    stream = InputStream(
             ^^^^^^^^^^^^
  File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 1440, in __init__
    _StreamBase.__init__(self, kind='input', wrap_callback='array',
  File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 909, in __init__
    _check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
  File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 2796, in _check
    raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening InputStream: Invalid sample rate [PaErrorCode -9997]

Thanks!

guynich · 2024-10-28T04:54:37Z

@curiositry
It’s possible your hardware does not support a native sampling rate of 16000Hz, or the drivers installed in your setup have some issue. Several suggestions you could try.

Check your Ubuntu has the latest PortAudio driver installed. See our demo/README file for more information.
Check this file sudo nano /etc/pulse/daemon.conf and look to see if the value of default-sample-rate has been set to 48000. If so you could try 16000 though this might affect other applications in your system.
A software work-around is for you to change the live_captions.py script to accept samples of 48000 rate then downsample these samples to 16000 in the main loop.

a) Change samplerate=(SAMPLING_RATE * 3), here:

moonshine/moonshine/demo/live_captions.py

Line 127 in 0aaaaa0

samplerate=SAMPLING_RATE,

b) Change blocksize=(CHUNK_SIZE * 3), here:

moonshine/moonshine/demo/live_captions.py

Line 129 in 0aaaaa0

blocksize=CHUNK_SIZE,

c) Add a line to downsample the chunk samples from 48000 to 16000 using numpy index slicing chunk = (chunk[0::3] + chunk[1::3] + chunk[2::3]) / 3 just before the line speech = np.concatenate((speech, chunk)) :

moonshine/moonshine/demo/live_captions.py

Line 150 in 0aaaaa0

curiositry · 2024-10-28T05:50:31Z

@guynich I am so grateful for the speedy and helpful reply! That worked like a charm.

I have the latest version of portaudio19-dev (19.6.0-1.2build3), and /etc/pulse/daemon.conf doesn't exist (so I didn't create it), but the changes you suggested did the trick.

With RATIO = 3 I'm getting a lot of input overflow messages, which I'll look into tomorrow (it's probably just a dumb mistake or failure to RTFM on my part).

HemanthSai7 · 2024-10-28T13:03:15Z

With modifications, is it possible to use the encoder in a transducer fashion?

guynich · 2024-10-28T13:16:48Z

@curiositry
See input overflow documentation:
https://python-sounddevice.readthedocs.io/en/0.3.15/api/misc.html#sounddevice.CallbackFlags.input_overflow
The callback in live_captions.py is already minimal. The only thing you could move outside of the callback is the flatten() call.

moonshine/moonshine/demo/live_captions.py

Line 70 in 0aaaaa0

q.put((data.copy().flatten(), status))

You could try removing the flatten() in the callback and add a new line chunk = chunk.flatten() before your downsampling code line. If that doesn’t work I’m thinking your hardware may not be suitable for sounddevice package stream methods. Hope it works for you.

guynich · 2024-10-28T13:37:15Z

@curiositry
Another thing you can try, though doubtful will help, is to check the latency field in stream. Print stream.latency after stream.start then try some different values as a new parameter e.g.: latency=<float_value_seconds>, in the instantiation of InputStream here:

moonshine/moonshine/demo/live_captions.py

Line 126 in 0aaaaa0

stream = InputStream(

curiositry · 2024-11-07T00:25:49Z

@guynich I tried all those approaches and more on my development machine (Ryzen 7 processor, plenty of RAM, Nvidia GPU, latest version of Linux Mint, PipeWire audio), and still got input overflow errors (interspersed with failures to find the audio device). Whisper.cpp stream works fine, so my audio is working in general for similar tasks, just not with Python soundevice.

However, the live_captions.py script works great on the RasPi 5, no audio config or resampling needed. And that's my target device, so getting it running on my laptop isn't cruicial.

Thanks for all your help! I'm sure I'll have more dumb questions soon :)

guynich · 2024-11-07T14:25:41Z

@curiosity
Thanks for sharing. We too had great results on Raspberry Pi 5 using live captions demo running Moonshine ONNX implementations. In our setup RPi5 is running Debian OS with the latest PortAudio driver. We’re working on other examples with Moonshine and plan to share more. I agree there can be issues working with the Linux audio stack. Glad to hear your target platform is working with Moonshine!

evmaki · 2024-11-27T20:51:31Z

Closing this issue as the feature request has been implemented and the discussion seems to have concluded. Please open a discussion thread if you'd like to discuss this further!

evmaki self-assigned this Oct 22, 2024

csukuangfj changed the title ~~How to acheive live transcription~~ How to achieve live transcription Oct 22, 2024

evmaki added the enhancement New feature or request label Oct 23, 2024

evmaki removed their assignment Oct 23, 2024

evmaki closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to achieve live transcription #19

How to achieve live transcription #19

csukuangfj commented Oct 22, 2024

info-wordcab commented Oct 22, 2024

evmaki commented Oct 22, 2024

andimarafioti commented Oct 22, 2024

sleepingcat4 commented Oct 25, 2024

evmaki commented Oct 25, 2024 •

edited

Loading

csukuangfj commented Oct 26, 2024

sleepingcat4 commented Oct 26, 2024

csukuangfj commented Oct 27, 2024

csukuangfj commented Oct 27, 2024

keveman commented Oct 27, 2024 •

edited

Loading

sleepingcat4 commented Oct 27, 2024

keveman commented Oct 27, 2024

sleepingcat4 commented Oct 27, 2024

guynich commented Oct 27, 2024 •

edited

Loading

guynich commented Oct 27, 2024

sleepingcat4 commented Oct 27, 2024

guynich commented Oct 27, 2024

guynich commented Oct 27, 2024 •

edited

Loading

curiositry commented Oct 28, 2024

guynich commented Oct 28, 2024 •

edited

Loading

curiositry commented Oct 28, 2024 •

edited

Loading

HemanthSai7 commented Oct 28, 2024 •

edited

Loading

guynich commented Oct 28, 2024

guynich commented Oct 28, 2024

curiositry commented Nov 7, 2024 •

edited

Loading

guynich commented Nov 7, 2024

evmaki commented Nov 27, 2024

How to achieve live transcription #19

How to achieve live transcription #19

Comments

csukuangfj commented Oct 22, 2024

info-wordcab commented Oct 22, 2024

evmaki commented Oct 22, 2024

andimarafioti commented Oct 22, 2024

sleepingcat4 commented Oct 25, 2024

evmaki commented Oct 25, 2024 • edited Loading

csukuangfj commented Oct 26, 2024

sleepingcat4 commented Oct 26, 2024

csukuangfj commented Oct 27, 2024

csukuangfj commented Oct 27, 2024

keveman commented Oct 27, 2024 • edited Loading

sleepingcat4 commented Oct 27, 2024

keveman commented Oct 27, 2024

sleepingcat4 commented Oct 27, 2024

guynich commented Oct 27, 2024 • edited Loading

guynich commented Oct 27, 2024

sleepingcat4 commented Oct 27, 2024

guynich commented Oct 27, 2024

guynich commented Oct 27, 2024 • edited Loading

curiositry commented Oct 28, 2024

guynich commented Oct 28, 2024 • edited Loading

curiositry commented Oct 28, 2024 • edited Loading

HemanthSai7 commented Oct 28, 2024 • edited Loading

guynich commented Oct 28, 2024

guynich commented Oct 28, 2024

curiositry commented Nov 7, 2024 • edited Loading

guynich commented Nov 7, 2024

evmaki commented Nov 27, 2024

evmaki commented Oct 25, 2024 •

edited

Loading

keveman commented Oct 27, 2024 •

edited

Loading

guynich commented Oct 27, 2024 •

edited

Loading

guynich commented Oct 27, 2024 •

edited

Loading

guynich commented Oct 28, 2024 •

edited

Loading

curiositry commented Oct 28, 2024 •

edited

Loading

HemanthSai7 commented Oct 28, 2024 •

edited

Loading

curiositry commented Nov 7, 2024 •

edited

Loading