Voice dataset: vocal extraction

There is background sound/music when I am only interested in her voice

I researched for a few long hours what to do
some links on the topics if you want to "waste" your time as I did

Spleeter is a solution

Bad column was written before I learned to use spleeter better. I am leaving this table for historical reasons.

Good	Bad
splits vocals and accompaniments from audiofile	converts output files to wav (correction) with sr=44khz
fast enough for me to use	output naming is weird: created directory with original filename and puts files in it: vocals.wav & accompaniment.wav (you can alter that)
no configuration required, works out of the box	no configuration possible, I would like to change how some stuff works
it opened my eyes to new problems	when converted to wav 25G blowed up to over 500G
???	I am not sure how it happened. But I screwed up naming and filenames stopped correlating to proper text in *metadata.tsv*. It is not Spleeter fault (should not be), but still.
flexible command line options to fix most of the bad	on my machine it struggles with more than 8minute audio

How I run it

Stuff about spleeter forcing you to have unlimited space by converting files to wav was exaggerated
One can pass other codec type to -c flag to avoid death by asphyxiation
Spleeter can take multiple input files, but there is `MAX_ARGS` limit
I say give it 3000 batches
Example of use

spleeter separate -o output -c ogg -f {filename}_{instrument}.{codec} \
"8_minute_slices/Let's Play Awakening [Rpg Maker Horror ⧸ Demo] 8 - Ups, eine schiefgelaufene Rettungsaktion？ [wX9ugDgY-iM].opus_00:13.000_00:21.000.opus_00.opus" \
...

Demonstration of spleeter

original.ogg

accompaniment.wav

vocals.wav

Everything takes a lot of space! Even file names!

It is my understanding that system ext4 (fat? | ntfs?) reserves space for names at directory creation (4KB) and extends that space if neccessary. Point is: when there is over 400k file names like
Let's Play Awakening [Rpg Maker Horror ⧸ Demo] 8 - Ups, eine schiefgelaufene Rettungsaktion？ [wX9ugDgY-iM].opus_00:13.000_00:21.000.opus
in the same directory, it causes simple commands like

ls | wc -l

to run for a few seconds, or even minutes if we are talking about my poor usb drive.

I am not yet clear on the solution, but most likely I will get some short hashing method and shorten all the names with it. For example

printf "Let's Play Awakening [Rpg Maker Horror ⧸ Demo] 8 - Ups, eine schiefgelaufene Rettungsaktion？ [wX9ugDgY-iM].opus_00:13.000_00:21.000.opus" | xxhsum | cut -d ' ' -f1
b9205a9cdf641557

I changed my mind about xxhsum. In golang it is an external library. I am going with md5.

spleeter separate -o output -c ogg -f {filename}_{instrument}.{codec} \
8_minutes_slices/2f56da703d60103bf9cf5848e1b3415a_01.opus \
8_minutes_slices/da18f80180b3fac3eadba34ae21502b6_01.opus \
8_minutes_slices/0a67183fe5a554e8af52b9e243d441e8_02.opus \
8_minutes_slices/07ad09a301080ca18890f7525d43b56a_01.opus \
8_minutes_slices/aef5fc3c8ec7bc6134e2a2cc20ceeccd_02.opus \
...

How do I spleet all audio to 8 minute clips

Most related function looks like that

func cutOnEqualParts(filepath, outname, segment string) error {
	err := ffmpeg.Input(filepath).
		Output(outname+"_%02d.opus",
			ffmpeg.KwArgs{
				"c":                "copy",
				"map":              0,
				"segment_time":     segment,
				"f":                "segment",
				"reset_timestamps": 1,
			}).
		OverWriteOutput().ErrorToStdOut().Run()
	return err
}

Check the Repo for the rest.

A lot of clips have silence or non-Her-voice sound

This is different from background sound (playing) at the same time that Her voice. It can be solved by trimming | slicing.
Good example on solving silence problem (vad)

Harder cases contains multiple voices, while only one of them is Hers (as opposite when She immitates|makes different voices to add character and immersion). Thats more like the case for nn (or gmm model) to know Nessi's voice features and simply to recognize Her voice.