[Curious] Why use both segmentation and diarization models? #942

shamuiscoding · 2024-12-12T08:12:55Z

I've been using WhisperX for diarization + ASR. An issue that I'm coming across is that the segmentation model (pyannote/segmentation) has sub-par performance compared to the diarization model (pyannote/diarization). And some segments end up gone or misplaced (i.e. should've been cut up into 2 segments, but weren't)

The diarization model also provides segments, and I've found those results to be much more accurate for diarizing quick dialogues (i.e. Right, Umm, etc). I'd guess the segmentation model would be trained on more segmentation data than the diarization model, so the authors used it. But for my case I feel that using only the diarization model would boost performance, that would mean whisperX is not needed and I should just run diarization -> whisper right?

Curious if anyone else had a similar experience.

update: I've used pyannote 3.1 to split audio segments, then ran faster-whisper on the chunks to get superior results with public domain data.

m-bain · 2024-12-15T05:07:20Z

its possible, i guess main advantage of segmentation models is that they are fast. However the perf drop could be due a bug or something, i feel like VAD is pretty solved so shouldnt have such errors like that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Curious] Why use both segmentation and diarization models? #942

[Curious] Why use both segmentation and diarization models? #942

shamuiscoding commented Dec 12, 2024 •

edited

Loading

m-bain commented Dec 15, 2024

[Curious] Why use both segmentation and diarization models? #942

[Curious] Why use both segmentation and diarization models? #942

Comments

shamuiscoding commented Dec 12, 2024 • edited Loading

m-bain commented Dec 15, 2024

shamuiscoding commented Dec 12, 2024 •

edited

Loading