I want to have best female german voice (that I know of: NesfateLP) to voice over everything german related
- A few definitions:
- NesfateLP is a channel on youtube (mostly about videogames);
- Nessi is the owner of female (german) voice;
List of reasons why she is the best (imo) candidate for voice cloning
- Superior voice (clear, likable, german native);
- Games that she plays are text heavy, so she reads / translates a lot;
- Valuable indie games without music or voice overs;
There a number of tts projects that can be used for voice clonning
- tts projects on github
- https://github.com/CorentinJ/Real-Time-Voice-Cloning
- https://github.com/coqui-ai/tts
- https://github.com/neonbjb/tortoise-tts
New methods and magickal formulas gets created all the time.
Data itself is a constant.
So the most important is to get data first.
Secure it before its get deleted or locked down.
The dataset should look like multiple of audio files with utterances from target person
with metadata text file that pairs those filenames with phrases / utterances they contain.
Dataset example LJ-Speech-Dataset
from lj speech dataset site:
Plan in steps: making dataset
- get audio
- make subtitles (with timestamps)
- slice audiofiles to phrase clips using timestamps
- compile metadata file
- filter out noisy | bad audio clips
- sanity check (bad grammar, too short/long)
- update metadata, compile statistics
Hardware requirements assumption
At this point of writing, I would say that requirements for following this dataset prep are:
- about 150G of space, I suggest it to be on ssd;
- cuda supporting (nvidia) gpu with at least 4G of vram;
- I asume it possible to survive with 6G of RAMM;
Get the data
In the first attempt I started to cherry pick Her playlists for being better suited for the training;
That was not smart.
It took a lot of time and in the end was pointless
(there is no good reason to track playlistst where utterance is coming from).
But since work was done, I'll provide playlist names of cherry-picked version
$ ls amayado_bus_stop black_forest danganronpa end_roll hello_charlotte libretta nomnomnami_games peret_em_heru sally_face the_liar_princess_and_the_blind_prince undertale angels_of_death blight_dream danganronpa_2 fairy_in_a_jar hospital_heat lisa omori pocket_mirror shoujo_kidan the_maid_of_farewell_heights underworld_capital_incident arias_story camp_sunshine dangan_v3 fantasy_maidens_odd_hideout imaginary_friends mad_father oneshots pony_island sin the_mystery_files wait awakening cat_in_the_box darkside_detective fran_bow insanity_remake midnight_puppeteer otaku_adventure purgatory stardew_valley to_the_moon wizard_of_the_white_box bad_dream_coma coffin_of_ashes densha_soul from_next_door just_ignore_them midnight_train papers_please rakuen stitched traumerei wolf_and_rabbit bevels_painting colors detention ghost_school kindergarten mother paranoiac red_book stray_cat trick_and_treat black_dream corpse_party dreamfarer hansel leftway needy_girl_overdose paranormal_syndrome_2 re_turn stygian turtlehead_unmasked
You see? That is annoying.
So far more breathtaking way is to download all (NesfateLP) channel audio alltogether
mkdir nesfatelp_audio cd nesfatelp_audio yt-dlp --embed-metadata -i -x -f bestaudio/best "https://www.youtube.com/channel/UCTxpPY8qj8dx6gsSf0TXCjA"
At the end of downloading I expect you to have about 90G of her audio
Conclusion and next page
- you learned about superior voice owner
- you have downloaded audio from her channel