Slice them!

Idea is simple. Consider example from random .vtt file

00:00.000 --> 00:05.640
Hallöchen ihr Lieben zu unserem neuen Let's Play Projekt, wir spielen ein neues RPG-Maker-Horrorgame,

Before the actuall text there is always timestamps line, that explains when that text should be shown.
Search and split by ` --> ` to get times, get next line for the text of these times.

Dumb approach

My first thought was to use gnu utils and write a shell script. This script would compile ffmpeg command and then run|eval it. It was not smart. The fact that I stored audio files on external usb drive did not help either. In the end I had 413k commands to run (cherrypicked version) with ~3s execution time per command (this route takes about two weeks to finish);

#!/bin/sh
#dumb implementation

set -e

filepath=${1:-}
audioname=$(printf "$filepath" | sed "s/\(.*\).vtt/\1/")
filename=$(basename "$audioname")

outdir=$(dirname "$filepath")/utterances
mkdir -p "$outdir"


# given vtt what is original ext?
ffmpeg_commands=$(grep "\-->" $filepath | awk -F " " \
    -v aname="$audioname" \
    -v bname="$filename" \
    -v outdir="$outdir" \
    '{print "ffmpeg -i "aname" -ss " $1 " -to " $3 \
    " -metadata text_source=" aname ".vtt" \
    " -ar 22050 " outdir "/" bname"_" $1"_"$3 ".wav"}')

while read cline;
do
    echo $cline;
    eval $cline;
done <<< $ffmpeg_commands

There must be a better way -> There is a better way

What are the reasons that define a dumb approach? Why is it dumb?

I do not want to write multithreading code in python, especially in the world where Threado is dead.
Goroutines on the other hand is like green threads. No idea what it means though. Probably that they are good for environment. Or possible alternative explanation.

Link to the repo with go solution
with this pillar code

type Utterance struct {
	LeftTime  string
	RightTime string
	Text      string
	OutPath   string
	FD        *FileData
}

type FileData struct {
	VttPath   string
	AudioPath string
	AudioBase string
}

func cutoutClipAndTranscode(ut *Utterance) error {
	err := ffmpeg.Input(ut.FD.AudioPath,
		ffmpeg.KwArgs{
			"ss": ut.LeftTime,
			"to": ut.RightTime,
		},
	).Output(ut.OutPath, ffmpeg.KwArgs{
		"metadata": fmt.Sprintf(`source="%s"`, ut.FD.VttPath),
	}).OverWriteOutput().ErrorToStdOut().Run()
	return err
}

ffmpeg is heavy on cpu, my cpu still works with 100 goroutines, but is it the optimal number?
The frog in the well does not know the answer. And I also do not know.
The command looks like

prep-dataset -d "/path/to/dir/with/audio/and/vtt"
If no errors, would produce the audio clips and metadata.tsv file
How fast is it? Did not mesured, but it took a few hours to produce all the clips.

In outdir now stored all the audio clips.
Also go programm produces metadata.tsv which contains
Filename\tText lines
(comma delimeter is unusable in this case; text or even filenames can contain commas or other symbols)

head metadata.tsv
Danganronpa 2: Goodbye Despair [Deutsch ⧸ Let's Play] #78 - Hammer-Diskussionen [lOVwt1rzPDQ].opus_17:41.000_17:49.000.opus    Dann lasst uns mit etwas beginnen, das sogar Miss Sonya verstehen kann. Die Waffe.
Danganronpa 2: Goodbye Despair [Deutsch ⧸ Let's Play] #9 - Comedy-Duo [vrbr-ZQ-wHk].opus_18:38.800_18:47.640.opus      einmal ich kenne und der sich in eurer Gruppe befindet nur ein Witz was sagst du denn da dieser
#2 - Lila Super-Sturm [kpmMEgREvfY].opus_12:28.640_12:31.640.opus       Ja, fast. Bischen daneben.
Stardew Valley [Let's Play ⧸ Deutsch] #69 - Fischlieferant [1zSdIz-off4].opus_09:19.680_09:20.680.opus  vorbei.
The Maid of Fairewell Heights [Let's Play ⧸ Deutsch] #3 - Magische Jinn-Katze [Zhj-hSWH0q8].opus_01:51.400_01:53.200.opus       Oh, tut mir leid.
The Mystery Files of Detective Inaba No. 1 (1⧸2) [Let's Stream ⧸ Deutsch] [9FU2ozJhWlE].opus_01:44:54.720_01:45:02.720.opus     Mr. Inaba, ich habe Dr. Harima gefunden und habe ihm gesagt, er soll schnell kommen,
Let's Play Awakening [Rpg Maker Horror ⧸ Demo] 8 - Ups, eine schiefgelaufene Rettungsaktion? [wX9ugDgY-iM].opus_00:13.000_00:21.000.opus       Was? Was? Was ist? Verflucht, ich bekomme es nicht hin. Dann wollen wir mal die Schrauben lösen.
Corpse Party: Book of Shadows [Deutsch ⧸ Let's Play] #7 - Unser Schicksal? [Ch. Seal] [JpzdiHEjhzY].opus_05:15.000_05:25.000.opus     Nein, Moment, daran erinnere ich mich auch. Wenn ich versuche sie hochzuheben, dann werde ich nur bewirken, dass ich sie noch stärker hinunterziehe und ihr Genick breche.
Danganronpa [Deutsch ⧸ Let's Play] #90 - Leiche [GwjWIOp8W0E].opus_12:09.400_12:11.400.opus     Neun Lichter?
Danganronpa 2: Goodbye Despair [Deutsch ⧸ Let's Play] #112 - Noch mehr tolle Gäste! [9FvHFXgUiHI].opus_07:32.000_07:33.000.opus        Natürlich, das ist schließlich mein Zweck.

Conclusion and next page

Next page is about voice|vocals extraction from the sliced audio clips.