Better solution out of the box: whisper
When your GPU allows it, you may consider using
openai's whisper
to convert audio to text
As I did with medium model
medium 769 M medium.en medium ~5 GB ~2x
- too big of the file (>~1GB) can freeze my machine (likely memory related)
- video stream is not needed
- using cuda depending code requires a proper gpu (nvidia)
A few moments to keep in mind
pacman -Q | grep cuda cuda 11.8.0-1 cuda-tools 11.8.0-1
getting the audio
ffmpeg -i v13_1.webm -vn -acodec copy v13_1.ogg
- -i: source file
- -vn: no video
- -acodec copy: copy audio (instead of processing/converting)
- v13_1.ogg: output file
using whisper it shall return three new files (.vtt .srt .txt) in directory that it was called from
whisper --model medium --language de v13_1.ogg
du -h v13_1.ogg* 416M v13_1.ogg 644K v13_1.ogg.srt 440K v13_1.ogg.txt 612K v13_1.ogg.vtt
And time to compare quality of the result
./quality_check.py overlord_audiobooks/vol_13/v13_1.ogg.txt 2>/dev/null 0.8536507196550318:68644:overlord_audiobooks/vol_13/v13_1.ogg.txt
- image recognition: 0.813456818978614
- audio recognition: 0.8536507196550318
Name problem
Reading the whisper result
Whisper gets confused about how to spell fantasy names
Some of variations of Neia (Baraha):
- Neyyaa
- Nehiaa
- Neyar
- Nehia
- Nehiawas
- Nerea
- Neha
- Neia (oh wow, correct one also used)
- Nehiyah
- Ne'ya (-_-)
- Neyawas
- Nelia