Voice dataset: changing approach

I now see that I used all the right tools in the wrong order. Time for the new order to emerge.

This page is about analyzing what was done untill now. Why is it wrong. What is the better way.

What was wrong?

Download all of the audio
Generate subs with whisper
Slice to smaller clips using timestamps made by whisper
Split vocal from accompaniment
Split whisper-vocal clips to end phrases
Filter out poor quality
Compile stats and publish

This way we carry bad sounds with us until very end.
Silence or voice of npc's va or background music.
It takes space, it takes time, it takes my electricity and the same time is completely useless

New order

Download all of the audio
Slice everythin to 8 minute clips
Split vocal from accompaniment
Define (and split) phrases inside vocal clips (VAD)
Filter out poor quality
Generate subs with whisper
Compile stats and publish

But what defines a phrase?

I now see that I used all the right tools in the wrong order. Time for the new order to emerge.

What was wrong?

New order

Voice activity detection - implementation