I have a task that I'm sure has been done before but I can't find a simple solution. Given a fasta file, with multiple sequences per species/individual name, I want to randomly sample one sequence per species/individual. Not each species/individual will always have the same number of sequences.
Ideally, the resulting sequences will be concatenated/pasted into a new file with the same name as the starting input fasta file (which corresponds to a gene name). I only have limited bash experience so help would be greatly appreciated!
If an answer below was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Explanation: use a hash of arrays, each species collects their sequences in an array, then just iterate over species getting one random value of the array length.
Thank you very much! This perl option works well and I think will be the most efficient for many files. Just a follow up - how can I get this to work as a loop since I have hundreds of fasta files? Sorry if that wasn't clear in the original post. I'm having issues:
for file in *.fasta; do (perl rand.pl<"$file")>"$file_new"; done
Thank you, but I'm not sure if this does what I want. First, I have standard fasta files
not .tsv files. Also, how does this randomly select one sequence per sample name and paste into a new file? I apologize if i just don’t understand.
If an answer below was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.