Remove extra sequences without unique names from fasta
2
0
Entering edit mode
6.4 years ago
nmkn ▴ 10

I have hundreds of large fasta files (with almost hundred sequences in each). In some files, I have duplicate sequence NAMES but the sequence itself is NOT a duplicate. I have found other similar posts, but they want to remove duplicate sequences: How To Remove The Same Sequences In The Fasta Files?, Remove Duplicates In Fasta (Protein Seq.)

I simply want to keep one sequence (for which multiple have the same name) and remove the others that have the same name (sequence themselves aren't unique). I'd think this could be a simple bash command but can't find a solution. I thought first to try and count the number of non-unique sequence names, but even that didn't work:

grep ">" fasta.file | uniq -c

  1 >Sample1
  1 >Sample2
  1 >Sample3
  1 >Sample1

Any suggestions for a simple bash script or other? Here's my sample fasta:

>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt
>Sample1
tctctccttg

So in this case, I want to keep the first Sample1 and remove the second. I have very limited scripting/bioinformatic experience so I greatly appreciate the help.

fasta uniq duplicates sequence bash • 2.9k views
ADD COMMENT
0
Entering edit mode

If the sequences themselves aren't unique, how do you know which sequence you want to keep for those with duplicated names?

I could quickly throw together a python script that will do what you're asking, but maybe someone has a quicker solution.

ADD REPLY
0
Entering edit mode

The short answer is it doesn't matter which one I keep.

ADD REPLY
0
Entering edit mode

By the way, the reason your grep | uniq -c did not work is because you need to sort before piping to uniq. So:

grep "^>" file.fa | sort | uniq -c
ADD REPLY
2
Entering edit mode
6.4 years ago

linearize + sort with unique flag + restore fasta:

 cat in.fasta in.fasta |\
 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\
 sort -t $'\t' -k1,1 -u |\
 tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Just to clarify the steps:

  1. (cat) I concatenate all hundreds of my fasta files into ONE fasta file?
  2. (awk) this is the linearize step

How are the next two steps working? I tried providing the input file but it errors:

sort -t $'\t' -k1,1 -u |\ < fasta.file_linearize.out
ADD REPLY
1
Entering edit mode

It looks like the awk step essentially converts from standard fasta format into a format of one entry per line. From there it's possible to use sort to only keep unique lines based on the fasta ID. The final step converts it back to normal fasta format (remove tabs, replace with new lines).

ADD REPLY
0
Entering edit mode

Thanks for the explanation. I can get the awk step to work using only one fasta file, but when I concat two files and use as input it doesn't work:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\ < sample_file > sample_linear.out

Also, is the above sort line correct to provide the input file?

ADD REPLY
1
Entering edit mode

If you are executing the code on a single line as you wrote, then remove the \ characters because those are meant to ignore the new lines as Pierre wrote it. I recommend structuring it just as Pierre did in a text file.

Try nano rm_dup_fasta.sh, then copy Pierre's code into the file.

chmod 755 rm_dup_fasta.sh will make the file executable.

Then ./rm_dup_fasta.sh to run the bash script.

ADD REPLY
0
Entering edit mode

I concatenate all hundreds of my fasta files into ONE fasta file?

' just testing that running my pipeline twice on the same fasta file will output only one ID.

How are the next two steps working?

http://man7.org/linux/man-pages/man1/sort.1.html

I tried providing the input file but it errors:

https://stackoverflow.com/a/3871336/58082

sort -t $'\t' -k1,1 -u |\ < fasta.file_linearize.out

this is not a bash syntax.

ADD REPLY
0
Entering edit mode

+1 for linking that very useful GitHub page. Never thought of linearizing a fasta before. Do you know if that awk line works for all fasta files, whether or not there are new lines within the sequences?

ADD REPLY
0
Entering edit mode
6.4 years ago

with seqkit: input:

 $ cat test.fa 
>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt
>Sample1
tctctccttg

output:

$ seqkit rmdup test.fa --quiet
>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt

Download seqkit from here

with case insensitive awk line, modified from one of the answers here. Upvote the OP):

$  awk '/^>/ {if (!a[tolower($0)]++) {print;getline;print}}' test.fa 
>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt
>sample4
ctga

input fasta:

>Sample1
tctctccttt
>Sample1
atgc
>Sample2
tctctcattt
>Sample3
tctctccttt
>Sample1
tctctccttg
>sample1
ttgc
>sample2
atgc
>sample4
ctga
>Sample2
tct

if you want random sequence every time, from fasta with duplicated IDs (case insensitive):

$ seqkit fx2tab test.fa  | datamash  -sig 1 rand 2 | seqkit tab2fx
ADD COMMENT
0
Entering edit mode

This seqkit worked and was easy! If you add it as an answer I can marked as "accept"

ADD REPLY
0
Entering edit mode

By default seqkit dedups by name (-n) option. However this is case sensitive. Current deduping by name doesn't work if you want to dedup sequences by name, case insensitive way. However case insensitive sequence deduping works.

ADD REPLY

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6