How to remove duplicate fasta sequences from a file
2
0
Entering edit mode
5.4 years ago
Kumar ▴ 170

Hi, I have total 250 files which contain scaffolds in fasta format, however several scaffolds are duplicate sequences with different headers in between files. Therefore, I want compare these files and make a file of unique scaffolds sequences. Please see following example files:

File 1:

>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG

File 2:

>NODE_250_length_56_cov_292 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_251_length_56_cov_157 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_252_length_56_cov_86 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

OUTPUT:

File 1:

>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
alignment sequence Assembly • 1.9k views
ADD COMMENT
2
Entering edit mode
5.4 years ago
h.mon 35k

Concatenate the fastas and use Dedupe or CD-HIT.

ADD COMMENT
0
Entering edit mode

I tried the following command:

cd-hit-454 -i /home/kumarm/CD-HIT/phy_D-scaffold-marge.fasta -o /home/kumarm/CD-HIT/454_reads_95 -c 0.99 -M 0 -T 7

error: Fatal Error: in diag_test_aapn_est, MAX_DIAG reached Program halted !!

ADD REPLY
2
Entering edit mode
5.4 years ago

Use seqkit:

$ cat file1.fa file2.fa | seqkit rmdup -s -o out.fa
ADD COMMENT
0
Entering edit mode

please let me know how to install seqkit?

ADD REPLY
0
Entering edit mode

I strongly recommend using bioconda.

The first part of this tutorial by me, might be useful for you.

ADD REPLY

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6