Slice them!
Idea is simple. Consider example from random .vtt file
00:00.000 --> 00:05.640 Hallöchen ihr Lieben zu unserem neuen Let's Play Projekt, wir spielen ein neues RPG-Maker-Horrorgame,
Before the actuall text there is always timestamps line, that explains when that text should be shown. Search and split by ` --> ` to get times, get next line for the text of these times.
Dumb approach
My first thought was to use gnu utils and write a shell script. This script would compile ffmpeg command and then run|eval it. It was not smart. The fact that I stored audio files on external usb drive did not help either. In the end I had 413k commands to run (cherrypicked version) with ~3s execution time per command (this route takes about two weeks to finish);
#!/bin/sh #dumb implementation set -e filepath=${1:-} audioname=$(printf "$filepath" | sed "s/\(.*\).vtt/\1/") filename=$(basename "$audioname") outdir=$(dirname "$filepath")/utterances mkdir -p "$outdir" # given vtt what is original ext? ffmpeg_commands=$(grep "\-->" $filepath | awk -F " " \ -v aname="$audioname" \ -v bname="$filename" \ -v outdir="$outdir" \ '{print "ffmpeg -i "aname" -ss " $1 " -to " $3 \ " -metadata text_source=" aname ".vtt" \ " -ar 22050 " outdir "/" bname"_" $1"_"$3 ".wav"}') while read cline; do echo $cline; eval $cline; done <<< $ffmpeg_commands
There must be a better way -> There is a better way
What are the reasons that define a dumb approach? Why is it dumb?
- Each command reads file from disk, instead of RAMM. Even if that file was already in memory before (by power of previous command);
- Single thread;
- Also operations like conversion to wav, 44kh->22kh. I suspect wav to be inferior format, even though it get a lot of love from ai people (if you can call them people). More research needed.
Goroutines on the other hand is like green threads. No idea what it means though. Probably that they are good for environment. Or possible alternative explanation.
Link to the repo with go solution
with this pillar code
type Utterance struct { LeftTime string RightTime string Text string OutPath string FD *FileData } type FileData struct { VttPath string AudioPath string AudioBase string } func cutoutClipAndTranscode(ut *Utterance) error { err := ffmpeg.Input(ut.FD.AudioPath, ffmpeg.KwArgs{ "ss": ut.LeftTime, "to": ut.RightTime, }, ).Output(ut.OutPath, ffmpeg.KwArgs{ "metadata": fmt.Sprintf(`source="%s"`, ut.FD.VttPath), }).OverWriteOutput().ErrorToStdOut().Run() return err }
ffmpeg is heavy on cpu, my cpu still works with 100 goroutines, but is it the optimal number?
The frog in the well does not know the answer. And I also do not know.
The command looks like
prep-dataset -d "/path/to/dir/with/audio/and/vtt"If no errors, would produce the audio clips and metadata.tsv file
How fast is it? Did not mesured, but it took a few hours to produce all the clips.
In outdir now stored all the audio clips.
Also go programm produces metadata.tsv which contains
Filename\tText lines
(comma delimeter is unusable in this case; text or even filenames can contain commas or other symbols)
head metadata.tsv Danganronpa 2: Goodbye Despair [Deutsch ⧸ Let's Play] #78 - Hammer-Diskussionen [lOVwt1rzPDQ].opus_17:41.000_17:49.000.opus Dann lasst uns mit etwas beginnen, das sogar Miss Sonya verstehen kann. Die Waffe. Danganronpa 2: Goodbye Despair [Deutsch ⧸ Let's Play] #9 - Comedy-Duo [vrbr-ZQ-wHk].opus_18:38.800_18:47.640.opus einmal ich kenne und der sich in eurer Gruppe befindet nur ein Witz was sagst du denn da dieser #2 - Lila Super-Sturm [kpmMEgREvfY].opus_12:28.640_12:31.640.opus Ja, fast. Bischen daneben. Stardew Valley [Let's Play ⧸ Deutsch] #69 - Fischlieferant [1zSdIz-off4].opus_09:19.680_09:20.680.opus vorbei. The Maid of Fairewell Heights [Let's Play ⧸ Deutsch] #3 - Magische Jinn-Katze [Zhj-hSWH0q8].opus_01:51.400_01:53.200.opus Oh, tut mir leid. The Mystery Files of Detective Inaba No. 1 (1⧸2) [Let's Stream ⧸ Deutsch] [9FU2ozJhWlE].opus_01:44:54.720_01:45:02.720.opus Mr. Inaba, ich habe Dr. Harima gefunden und habe ihm gesagt, er soll schnell kommen, Let's Play Awakening [Rpg Maker Horror ⧸ Demo] 8 - Ups, eine schiefgelaufene Rettungsaktion? [wX9ugDgY-iM].opus_00:13.000_00:21.000.opus Was? Was? Was ist? Verflucht, ich bekomme es nicht hin. Dann wollen wir mal die Schrauben lösen. Corpse Party: Book of Shadows [Deutsch ⧸ Let's Play] #7 - Unser Schicksal? [Ch. Seal] [JpzdiHEjhzY].opus_05:15.000_05:25.000.opus Nein, Moment, daran erinnere ich mich auch. Wenn ich versuche sie hochzuheben, dann werde ich nur bewirken, dass ich sie noch stärker hinunterziehe und ihr Genick breche. Danganronpa [Deutsch ⧸ Let's Play] #90 - Leiche [GwjWIOp8W0E].opus_12:09.400_12:11.400.opus Neun Lichter? Danganronpa 2: Goodbye Despair [Deutsch ⧸ Let's Play] #112 - Noch mehr tolle Gäste! [9FvHFXgUiHI].opus_07:32.000_07:33.000.opus Natürlich, das ist schließlich mein Zweck.
Conclusion and next page
- generation and execution separate ffmpeg commands from shell with io operations on usb drive was dumm;
- my golang-written solution was much better
- so I finally have my clips and metadata.tsv file