Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to achieve live transcription #19

Closed
csukuangfj opened this issue Oct 22, 2024 · 27 comments
Closed

How to achieve live transcription #19

csukuangfj opened this issue Oct 22, 2024 · 27 comments
Labels
enhancement New feature or request

Comments

@csukuangfj
Copy link

The title of the paper https://arxiv.org/pdf/2410.15608 is

Moonshine: Speech Recognition for Live Transcription and Voice Commands

However, the model is a non-streaming model, could you describe how to achieve live transcription?

The demo in this repo is for decoding files,
it would be nice if you can provide a demo for live transcriptions.

@info-wordcab
Copy link

Definitely would be interesting to see a live transcription demo!

@evmaki
Copy link
Contributor

evmaki commented Oct 22, 2024

Thanks for the request! We are working on a live transcription demo that we will add to the repo soon.

@evmaki evmaki self-assigned this Oct 22, 2024
@andimarafioti
Copy link
Contributor

I know it's not exactly what you asked for, for I added the model to Hugging Face's speech to speech library, so it does live translation
carbon (8)

@csukuangfj csukuangfj changed the title How to acheive live transcription How to achieve live transcription Oct 22, 2024
@evmaki evmaki added the enhancement New feature or request label Oct 23, 2024
@evmaki evmaki removed their assignment Oct 23, 2024
@sleepingcat4
Copy link

@evmaki I tried to ran a benchmark and results were very disappointing. I had used below specifications:

  1. JAX backend
  2. 10s splits
  3. percussive component is used for transcription
  4. librosa to remove silence (trim func)
  5. tiny variation of moonshine was used

in the benchmark only moonshine was loaded and used for transcriptions on the audio segments. rest were preprocessing steps. result was 48 minutes to transcribe 1 hour 28 minutes long video on CPU.

I ran a benchmark using Faster-whisper tiny model on CPU with below specifications

  1. CPU backend
  2. int8
  3. same preprocessing steps
  4. 10s segments was fed at a given moment
  5. same func as moonshine was used except model was changed

It took under 8 minutes to transcribe that same 1 hour 28 minutes audio file. Even I liked faster Whisper's tiny model quality of transcriptions better.

provided these results how you guys are promising live-transcriptions?

@evmaki
Copy link
Contributor

evmaki commented Oct 25, 2024

@sleepingcat4 thanks for benchmarking and sharing your results. The Keras implementation currently has some speed issues, which is what's causing this. We've added ONNX models that run much faster. I encourage people to try those out.

Re: live transcriptions. We'll soon be merging a demo (using the ONNX models) that shows live captioning in action. The branch is already public if you want to check it out. Demo script is located here.

PS: What CPU are you running on? And are you willing to share your script so we can reproduce your results?

@csukuangfj
Copy link
Author

@evmaki

Thank you for sharing the demo!

By the way, for the following two lines:

speech = np.concatenate((speech, chunk))

print_captions(transcribe(speech))

they result in redundant computation, which is not efficient.

Is there a plan to release a streaming model?

@sleepingcat4
Copy link

@evmaki below is the function I used to transcribe 10s segement at a moment.

def transcribe_folder(audio_folder, output_file, model='moonshine/tiny'):
    initial_time = time.time()

    transcription = ""
    audio_files = sorted(
        [f for f in os.listdir(audio_folder) if f.endswith('.wav')],
        key=lambda x: int(re.search(r'part(\d+)\.wav', x).group(1))
    )

    for i, file_name in enumerate(audio_files):
        file_path = os.path.join(audio_folder, file_name)
        transcript = moonshine.transcribe(file_path, model)
        transcript = " ".join(transcript)

        seg_st_time = i * 10
        seg_en_time = seg_st_time + 10

        start_h = seg_st_time // 3600
        start_m = (seg_st_time % 3600) // 60
        start_s = seg_st_time % 60

        end_h = seg_en_time // 3600
        end_m = (seg_en_time % 3600) // 60
        end_s = seg_en_time % 60

        transcription += f"{start_h:02}:{start_m:02}:{start_s:02}-{end_h:02}:{end_m:02}:{end_s:02}s: {transcript}\n"

    with open(output_file, 'w') as f:
        f.write(transcription)

    end_time = time.time()
    execution_time = (end_time - initial_time) / 60
    print(f"Execution Time: {execution_time:.2f} minutes")
    print(f"Transcription saved to: {output_file}")

transcribe_folder("Yj7ZDcHGtK", output_file="bench_script.txt")

@csukuangfj
Copy link
Author

@evmaki I tried to ran a benchmark and results were very disappointing. I had used below specifications:

  1. JAX backend
  2. 10s splits
  3. percussive component is used for transcription
  4. librosa to remove silence (trim func)
  5. tiny variation of moonshine was used

in the benchmark only moonshine was loaded and used for transcriptions on the audio segments. rest were preprocessing steps. result was 48 minutes to transcribe 1 hour 28 minutes long video on CPU.

I ran a benchmark using Faster-whisper tiny model on CPU with below specifications

  1. CPU backend
  2. int8
  3. same preprocessing steps
  4. 10s segments was fed at a given moment
  5. same func as moonshine was used except model was changed

It took under 8 minutes to transcribe that same 1 hour 28 minutes audio file. Even I liked faster Whisper's tiny model quality of transcriptions better.

provided these results how you guys are promising live-transcriptions?

@sleepingcat4

I suggest that you test it again with sherpa-onnx, which has supported Moonshine models.

The following colab notebook guides you step-by-step how to do that.

https://github.com/k2-fsa/colab/blob/master/sherpa-onnx/RTF_comparison_betwen_whisper_and_moonshine.ipynb

@keveman
Copy link
Contributor

keveman commented Oct 27, 2024

Thanks so much @csukuangfj, for providing the notebook, and the comparing with Whisper. Just to summarize, Moonshine tiny can transcribe 335.2 seconds of audio in 19.6 seconds whereas whisper tiny.en needs 64.1 seconds. That's a 3.3x speedup, and the transcription quality looks identical. @sleepingcat4 I hope this helps alleviate some of your disappointment. The provided Keras implementation is indeed non-optimal and we released it to be a reference with future enhancement in mind. Most deployments we had in mind (such as on SBCs) would never have Torch or TF or JAX or other such large frameworks installed. ONNX, TFLite or some other homegrown runtimes are best suitable for seeing the benefits of Moonshine.

@sleepingcat4
Copy link

@csukuangfj thanks for the notebook. I was writing an audio transcription pipeline so I needed to have benchmarks on large audio before to get started. results on your notebook looks interesting (my benchmark avoided ONNX models) but I don't think moonshine is still a good fit for making my dataset or pipeline.

@keveman Moonshine's transcription quality is par to OpenAI Whipser model but it not exactly at the same level with faster-whisper tiny int8 model. Faster-whisper can capture more nuanced information. If moonshine can match both the speed of tiny whisper on faster-whipser library, I would definitely integrate it into my pipeline.

But, I will look into the ONNX models and few other models as well. Because from LAION AI, we are planning to maintain an open-source repo to share our pipeline and benchmark models and share them with everyone. Meanwhile really thanks for the responses and help.

@keveman
Copy link
Contributor

keveman commented Oct 27, 2024

@sleepingcat4 faster-whisper and OpenAI whisper are the same underlying models, so I am not sure what you mean by getting better quality with faster-whisper. In any case, we do have a PR out for running moonshine with CTranslate2, which is what faster-whisper is based on.

@sleepingcat4
Copy link

@keveman Yes, the models are same but faster whipser allows int8 for even tiny whisper and the CTranslate2 is doing rest of the speed-up I think. I meant Whisper tiny does a better transcription job than moonshine. I went through both the transcriptions for the benchmark I did and whisper-tiny int8 was able to capture last 1,2 lines that moonshine didn't.

And for me those lines are important. I am also going to run a test with sensevoice maybe that can beat whisper. But, Moonshine right now is third on my list.

@guynich
Copy link
Contributor

guynich commented Oct 27, 2024

@sleepingcat4 thank you for checking the branch code! I'm authoring the live captions script. To compare this script, (which in current form provides a console user interface that emulates streaming-style live captions with very frequent console refreshes), with faster-whisper speech chunking method requires script changes. To my knowledge they both use the silero-vad iterator class. I'll come back to this issue thread with a summary of suggestions after checking.

  • edited: replaced GUI with console user interface.

@guynich
Copy link
Contributor

guynich commented Oct 27, 2024

@sleepingcat4
Faster-whisper transcriber includes methods to transcribe segments of speech into text. It uses code adapted from silero-vad, see:
https://github.com/SYSTRAN/faster-whisper/blob/c2a1da1bd94e002c38487c91c2f6b50a048000cf/faster_whisper/vad.py#L13C1-L13C73.
I was mistaken in thinking faster-whisper uses the VADIterator class though I do see similarities in faster-whisper’s adapted code. Moonshine repo does not have currently an equivalent speech segmentation method.

A suggestion to directly compare faster-whisper and Moonshine models when testing on audio file examples is to adapt faster-whisper code for use with Moonshine models., e.g.: adapt code from
https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py.
We don’t have currently a plan for this in our repo.

For our open-source live_captions.py demo script I chose to use silero-vad native VADIterator class for speech chunking. This provides simpler code for our standalone demo script and uses the same VAD model as faster-whisper.

Here are suggestions how to adapt live_captions.py demo script to be more similar to faster-whisper speech segmentation.

  1. Disable refresh model inferences by commenting out this line and if statement.

    print_captions(transcribe(speech))

    This will block most of the streaming-style caption updates from the console (which currently we want in our demo) and thus will save computation as already commented by @csukuangfj above, e.g.:
    How to achieve live transcription #19 (comment)

  2. Set MAX_SPEECH_SECS = 30 to be same as Whisper model context window.

  3. Copy over relevant Silero VAD model options used in faster-whisper code,
    e.g.: from here
    https://github.com/SYSTRAN/faster-whisper/blob/c2a1da1bd94e002c38487c91c2f6b50a048000cf/faster_whisper/vad.py#L14
    to here:

    vad_iterator = VADIterator(

    I chose not to specify all VADIterator parameters to provide simpler code for our standalone demo script. I think the most significant difference is my current script uses min_silence_duration_ms=300 and not the default value of 2000 milliseconds used in faster-whisper vad.py. In our demo testing I’ve seen this lower value of 300 milliseconds is more responsive for microphone captured speech with more frequent detection of speech pauses and more frequent commits to the cache. Interested people should try experimenting with this value and the other available parameters in the silero-vad VADIterator class:
    https://github.com/snakers4/silero-vad/blob/e531cd3462189f275e0a231f0cefb21816147ed2/src/silero_vad/utils_vad.py#L394

I hope this information helps and thank you for your comments.

@sleepingcat4
Copy link

@guynich what do you mean by Chunking? (in this context)

like if I even consider 10s segment transcriptions for both moonshine and faster-whisper, faster-whisper tiny int8 model does a better job than moonshine tiny.

also moonshine in my experiments does worse if I don't use percussive component and rather feed the original audio file. Even when harmonic component (background noise is removed) moonshine's transcriptions are slightly bad than whisper tiny int8 (beam_size=5).

so while I am interested into the moonshine model but it doesn't make sense to use it for production or Dataset generation pipeline. (I didn't use ONNX yet)

@guynich
Copy link
Contributor

guynich commented Oct 27, 2024

@sleepingcat4

@guynich what do you mean by Chunking? (in this context)

By chunking I mean speech segmentation as done in faster-whisper transcribe() method here.
https://github.com/SYSTRAN/faster-whisper/blob/c2a1da1bd94e002c38487c91c2f6b50a048000cf/faster_whisper/transcribe.py#L437
I believe this means even shorter clips like 10 seconds may be divided into shorter segments depending on talker's pauses.

Thanks for your comments.

@guynich
Copy link
Contributor

guynich commented Oct 27, 2024

I've merged the live_captions.py demo script today with help from @keveman and @evmaki. I'm removing the redundant branch guy/live_captions now that this work is merged.

@curiositry
Copy link

Thanks for this @guynich!

Do I need to do anything to adjust my system sample rate (48kHz) for this to work?

I'm getting:

(env_moonshine) $ python3 moonshine/moonshine/demo/live_captions.py
Loading Moonshine model 'moonshine/base' (using ONNX runtime) ...
Expression 'paInvalidSampleRate' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2048
Expression 'PaAlsaStreamComponent_InitialConfigure( &self->capture, inParams, self->primeBuffers, hwParamsCapture, &realSr )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2718
Expression 'PaAlsaStream_Configure( stream, inputParameters, outputParameters, sampleRate, framesPerBuffer, &inputLatency, &outputLatency, &hostBufferSizeMode )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 2842
Traceback (most recent call last):
  File "/path/to/moonshine/moonshine/moonshine/demo/live_captions.py", line 126, in <module>
    stream = InputStream(
             ^^^^^^^^^^^^
  File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 1440, in __init__
    _StreamBase.__init__(self, kind='input', wrap_callback='array',
  File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 909, in __init__
    _check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
  File "/path/to/moonshine/env_moonshine/lib/python3.12/site-packages/sounddevice.py", line 2796, in _check
    raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening InputStream: Invalid sample rate [PaErrorCode -9997]

Thanks!

@guynich
Copy link
Contributor

guynich commented Oct 28, 2024

@curiositry
It’s possible your hardware does not support a native sampling rate of 16000Hz, or the drivers installed in your setup have some issue. Several suggestions you could try.

  1. Check your Ubuntu has the latest PortAudio driver installed. See our demo/README file for more information.

  2. Check this file sudo nano /etc/pulse/daemon.conf and look to see if the value of default-sample-rate has been set to 48000. If so you could try 16000 though this might affect other applications in your system.

  3. A software work-around is for you to change the live_captions.py script to accept samples of 48000 rate then downsample these samples to 16000 in the main loop.

a) Change samplerate=(SAMPLING_RATE * 3), here:

samplerate=SAMPLING_RATE,

b) Change blocksize=(CHUNK_SIZE * 3), here:

blocksize=CHUNK_SIZE,

c) Add a line to downsample the chunk samples from 48000 to 16000 using numpy index slicing chunk = (chunk[0::3] + chunk[1::3] + chunk[2::3]) / 3 just before the line speech = np.concatenate((speech, chunk)) :

@curiositry
Copy link

curiositry commented Oct 28, 2024

@guynich I am so grateful for the speedy and helpful reply! That worked like a charm.

I have the latest version of portaudio19-dev (19.6.0-1.2build3), and /etc/pulse/daemon.conf doesn't exist (so I didn't create it), but the changes you suggested did the trick.

With RATIO = 3 I'm getting a lot of input overflow messages, which I'll look into tomorrow (it's probably just a dumb mistake or failure to RTFM on my part).

@HemanthSai7
Copy link

HemanthSai7 commented Oct 28, 2024

With modifications, is it possible to use the encoder in a transducer fashion?

@guynich
Copy link
Contributor

guynich commented Oct 28, 2024

@curiositry
See input overflow documentation:
https://python-sounddevice.readthedocs.io/en/0.3.15/api/misc.html#sounddevice.CallbackFlags.input_overflow
The callback in live_captions.py is already minimal. The only thing you could move outside of the callback is the flatten() call.

q.put((data.copy().flatten(), status))

You could try removing the flatten() in the callback and add a new line chunk = chunk.flatten() before your downsampling code line. If that doesn’t work I’m thinking your hardware may not be suitable for sounddevice package stream methods. Hope it works for you.

@guynich
Copy link
Contributor

guynich commented Oct 28, 2024

@curiositry
Another thing you can try, though doubtful will help, is to check the latency field in stream. Print stream.latency after stream.start then try some different values as a new parameter e.g.: latency=<float_value_seconds>, in the instantiation of InputStream here:

stream = InputStream(

@curiositry
Copy link

curiositry commented Nov 7, 2024

@guynich I tried all those approaches and more on my development machine (Ryzen 7 processor, plenty of RAM, Nvidia GPU, latest version of Linux Mint, PipeWire audio), and still got input overflow errors (interspersed with failures to find the audio device). Whisper.cpp stream works fine, so my audio is working in general for similar tasks, just not with Python soundevice.

However, the live_captions.py script works great on the RasPi 5, no audio config or resampling needed. And that's my target device, so getting it running on my laptop isn't cruicial.

Thanks for all your help! I'm sure I'll have more dumb questions soon :)

@guynich
Copy link
Contributor

guynich commented Nov 7, 2024

@curiosity
Thanks for sharing. We too had great results on Raspberry Pi 5 using live captions demo running Moonshine ONNX implementations. In our setup RPi5 is running Debian OS with the latest PortAudio driver. We’re working on other examples with Moonshine and plan to share more. I agree there can be issues working with the Linux audio stack. Glad to hear your target platform is working with Moonshine!

@evmaki
Copy link
Contributor

evmaki commented Nov 27, 2024

Closing this issue as the feature request has been implemented and the discussion seems to have concluded. Please open a discussion thread if you'd like to discuss this further!

@evmaki evmaki closed this as completed Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants