Voice dataset: reasoning

I want to have best female german voice (that I know of: NesfateLP) to voice over everything german related

NesfateLP is a channel on youtube (mostly about videogames);
Nessi is the owner of female (german) voice;

List of reasons why she is the best (imo) candidate for voice cloning

Superior voice (clear, likable, german native);
Games that she plays are text heavy, so she reads / translates a lot;
Valuable indie games without music or voice overs;

There a number of tts projects that can be used for voice clonning

https://github.com/CorentinJ/Real-Time-Voice-Cloning
https://github.com/coqui-ai/tts
https://github.com/neonbjb/tortoise-tts

New methods and magickal formulas gets created all the time.
Data itself is a constant.
So the most important is to get data first. Secure it before its get deleted or locked down.
The dataset should look like multiple of audio files with utterances from target person with metadata text file that pairs those filenames with phrases / utterances they contain.

Dataset example LJ-Speech-Dataset

from lj speech dataset site:

Plan in steps: making dataset

get audio
make subtitles (with timestamps)
slice audiofiles to phrase clips using timestamps
compile metadata file
filter out noisy | bad audio clips
sanity check (bad grammar, too short/long)
update metadata, compile statistics

Hardware requirements assumption

At this point of writing, I would say that requirements for following this dataset prep are:

about 150G of space, I suggest it to be on ssd;
cuda supporting (nvidia) gpu with at least 4G of vram;
I asume it possible to survive with 6G of RAMM;

Get the data

In the first attempt I started to cherry pick Her playlists for being better suited for the training;
That was not smart.
It took a lot of time and in the end was pointless
(there is no good reason to track playlistst where utterance is coming from).
But since work was done, I'll provide playlist names of cherry-picked version

$ ls
amayado_bus_stop  black_forest     danganronpa         end_roll                     hello_charlotte    libretta             nomnomnami_games       peret_em_heru  sally_face      the_liar_princess_and_the_blind_prince  undertale
angels_of_death   blight_dream     danganronpa_2       fairy_in_a_jar               hospital_heat      lisa                 omori                  pocket_mirror  shoujo_kidan    the_maid_of_farewell_heights            underworld_capital_incident
arias_story       camp_sunshine    dangan_v3           fantasy_maidens_odd_hideout  imaginary_friends  mad_father           oneshots               pony_island    sin             the_mystery_files                       wait
awakening         cat_in_the_box   darkside_detective  fran_bow                     insanity_remake    midnight_puppeteer   otaku_adventure        purgatory      stardew_valley  to_the_moon                             wizard_of_the_white_box
bad_dream_coma    coffin_of_ashes  densha_soul         from_next_door               just_ignore_them   midnight_train       papers_please          rakuen         stitched        traumerei                               wolf_and_rabbit
bevels_painting   colors           detention           ghost_school                 kindergarten       mother               paranoiac              red_book       stray_cat       trick_and_treat
black_dream       corpse_party     dreamfarer          hansel                       leftway            needy_girl_overdose  paranormal_syndrome_2  re_turn        stygian         turtlehead_unmasked

You see? That is annoying.
So far more breathtaking way is to download all (NesfateLP) channel audio alltogether

mkdir nesfatelp_audio
cd nesfatelp_audio
yt-dlp --embed-metadata -i -x -f bestaudio/best "https://www.youtube.com/channel/UCTxpPY8qj8dx6gsSf0TXCjA"

At the end of downloading I expect you to have about 90G of her audio

Conclusion and next page

you learned about superior voice owner
you have downloaded audio from her channel

Next page is about getting text lines for her audio utterances.