You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using WhisperX for diarization + ASR. An issue that I'm coming across is that the segmentation model (pyannote/segmentation) has sub-par performance compared to the diarization model (pyannote/diarization). And some segments end up gone or misplaced (i.e. should've been cut up into 2 segments, but weren't)
The diarization model also provides segments, and I've found those results to be much more accurate for diarizing quick dialogues (i.e. Right, Umm, etc). I'd guess the segmentation model would be trained on more segmentation data than the diarization model, so the authors used it. But for my case I feel that using only the diarization model would boost performance, that would mean whisperX is not needed and I should just run diarization -> whisper right?
Curious if anyone else had a similar experience.
update: I've used pyannote 3.1 to split audio segments, then ran faster-whisper on the chunks to get superior results with public domain data.
The text was updated successfully, but these errors were encountered:
its possible, i guess main advantage of segmentation models is that they are fast. However the perf drop could be due a bug or something, i feel like VAD is pretty solved so shouldnt have such errors like that
I've been using WhisperX for diarization + ASR. An issue that I'm coming across is that the segmentation model (pyannote/segmentation) has sub-par performance compared to the diarization model (pyannote/diarization). And some segments end up gone or misplaced (i.e. should've been cut up into 2 segments, but weren't)
The diarization model also provides segments, and I've found those results to be much more accurate for diarizing quick dialogues (i.e. Right, Umm, etc). I'd guess the segmentation model would be trained on more segmentation data than the diarization model, so the authors used it. But for my case I feel that using only the diarization model would boost performance, that would mean whisperX is not needed and I should just run diarization -> whisper right?
Curious if anyone else had a similar experience.
update: I've used pyannote 3.1 to split audio segments, then ran faster-whisper on the chunks to get superior results with public domain data.
The text was updated successfully, but these errors were encountered: