Voice dataset: generating subtitles

Make subtitles for each audio using whisper

Surely you know about openai/whisper
or if not go read about it.
in short: it is a tool to transcribe text from audio.

Does not matter how good your GPU is, it takes forever to generate subs for 90G of audio.
Thats where faster-whisper enters the stage. It might be 4x times faster as original whisper without losing accuracy! (but sometimes introducing new bugs though);

This is no handholding | spoonfeeding article. It is more about sharing my experience than providing a tutorial. Also in these dynamic times the more specific you get, less valuable you become when something change.

install whisper;
download medium model (not medium.en, we're targeting german);
download faster whisper repo;
pip install inside of virtualenv (there is two requirements files, use both if stuck);

ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --quantization float16

write a python script to transcribe files (I'll share mine);

Transcribe script (mostly inspired by original)

from faster_whisper import WhisperModel

model_path = "/home/grail/projects/third_party/faster-whisper/whisper-medium-ct2"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")

def format_timestamp(seconds, always_include_hours=False, decimal_marker="."):
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"

def write_result(file, segments):
    print("WEBVTT\n", file=file)
    for segment in segments:
        print(
            f"{format_timestamp(segment.start)} --> {format_timestamp(segment.end)}\n"
            f"{segment.text.strip().replace('-->', '->')}\n",
            file=file, flush=True,
        )
        print(
            f"{format_timestamp(segment.start)} --> {format_timestamp(segment.end)}\n"
            f"{segment.text.strip().replace('-->', '->')}\n",
        )

if __name__ == "__main__":
    import sys
    import os.path

    filepath = sys.argv[1]
    vtt_out = filepath + ".vtt"
    if os.path.isfile(vtt_out):
        print(vtt_out, "already exists")
        exit(0)
    segments, _ = model.transcribe(filepath, beam_size=5, language="de")
    with open(vtt_out, "w") as lf:
        write_result(lf, segments)

Script will take path to audio file (downloaded from youtube it mostly opus and m4a)
and shall (if no mistakes) create output file with same name as original (including extension) plus .vtt

vtt_out = filepath + ".vtt"

In future we will depend on this naming convention: that you can take .vtt file, remove .vtt extension and get original audio file.

Now with that script all you need to do is run something like

for f in /path/to/audios; do python fw_transcribe.py "$f"; done

and wait a few days (and nights). Should be less than a week.

Conclusion and next page

you learned about faster whisper
you extracted subs for 90G audio (after less than a week of gpu run)

Next page is about slicing original audio to phrases and making metadata file