Entering edit mode
5.4 years ago
Kumar
▴
170
Hi, I have total 250 files which contain scaffolds in fasta format, however several scaffolds are duplicate sequences with different headers in between files. Therefore, I want compare these files and make a file of unique scaffolds sequences. Please see following example files:
File 1:
>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
File 2:
>NODE_250_length_56_cov_292 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_251_length_56_cov_157 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_252_length_56_cov_86 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
OUTPUT:
File 1:
>NODE_265_length_56_cov_170 [gcode=11] [organism=Escherichia species] [strain=strain]
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>NODE_266_length_56_cov_121 [gcode=11] [organism=Escherichia species] [strain=strain]
GGTTCATCGATAGGAATTTAAATCCCCAAAAGACTAAAAAAGCATCACAAAACGGA
>NODE_267_length_56_cov_67 [gcode=11] [organism=Escherichia species] [strain=strain]
ATTATTTTTGTGGAGCCGGAGGAAACAAACCAGACGGTTCAGATGAGGCGCTTACG
>NODE_268_length_56_cov_43 [gcode=11] [organism=Escherichia species] [strain=strain]
TCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
>NODE_253_length_56_cov_29 [gcode=11] [organism=Escherichia species] [strain=strain]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I tried the following command:
cd-hit-454 -i /home/kumarm/CD-HIT/phy_D-scaffold-marge.fasta -o /home/kumarm/CD-HIT/454_reads_95 -c 0.99 -M 0 -T 7
error: Fatal Error: in diag_test_aapn_est, MAX_DIAG reached Program halted !!