I simply want to keep one sequence (for which multiple have the same name) and remove the others that have the same name (sequence themselves aren't unique). I'd think this could be a simple bash command but can't find a solution. I thought first to try and count the number of non-unique sequence names, but even that didn't work:
grep ">" fasta.file | uniq -c
1 >Sample1
1 >Sample2
1 >Sample3
1 >Sample1
Any suggestions for a simple bash script or other? Here's my sample fasta:
So in this case, I want to keep the first Sample1 and remove the second. I have very limited scripting/bioinformatic experience so I greatly appreciate the help.
It looks like the awk step essentially converts from standard fasta format into a format of one entry per line. From there it's possible to use sort to only keep unique lines based on the fasta ID. The final step converts it back to normal fasta format (remove tabs, replace with new lines).
If you are executing the code on a single line as you wrote, then remove the \ characters because those are meant to ignore the new lines as Pierre wrote it. I recommend structuring it just as Pierre did in a text file.
Try nano rm_dup_fasta.sh, then copy Pierre's code into the file.
chmod 755 rm_dup_fasta.sh will make the file executable.
+1 for linking that very useful GitHub page. Never thought of linearizing a fasta before. Do you know if that awk line works for all fasta files, whether or not there are new lines within the sequences?
By default seqkit dedups by name (-n) option. However this is case sensitive. Current deduping by name doesn't work if you want to dedup sequences by name, case insensitive way. However case insensitive sequence deduping works.
If the sequences themselves aren't unique, how do you know which sequence you want to keep for those with duplicated names?
I could quickly throw together a python script that will do what you're asking, but maybe someone has a quicker solution.
The short answer is it doesn't matter which one I keep.
By the way, the reason your
grep | uniq -c
did not work is because you need to sort before piping to uniq. So: