I have millions of fasta files each containing many DNA sequences.
I want to batch remove/delete reads (DNA sequences) that are unique in each fasta file, respectively. I only want to keep duplicate reads that occur at least two times in each fasta file, respectively.
Does anybody know a command line based solution or a program/script doing that?
Many thanks!
seqkit common is what you are looking for.
I guess this is an excellent solution for finding common sequences between two or more files (I tried it). However, I just want to extract sequences that occur at least two times within ONE file containing many fasta sequences (OR even better: to remove unique sequences in ONE fasta file). Finally, this should be done for millions of fasta files.
However, seqkit might be the right choice when using the rmdup command and with -d parameter. seqkit rmdup
for
loop in bash over your files