convert text file to fasta file
0
0
Entering edit mode
8.2 years ago
sukesh1411 ▴ 30

Hi

It might be very simple question, but i could not convert the text file which has sequences in below format.. to .fasta file??

>gi|4|emb|X17276.1| Giant Panda satellite 1 DNA
GATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT
GGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTC
ACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTC
CGCGACCACGTTCCCTCATGTTTCCCTATTAACGAAGGGTGATGATAGTGCTAAGACGGTCCCTGTACGGTGTTGTTTCT
GACAGACGTGTTTTGGGCCTTTTCGTTCCATTGCCGCCAGCAGTTTTGACAGGATTTCCCCAGGGAGCAAACTTTTCGAT
GGAAACGGGTTTTGGCCGAATTGTCTTTCTCAGTGCTGTGTTCGTCGTGTTTCACTCACGGTACCAAAACACCTTGATTA
TTGTTCCACCCTCCATAAGGCCGTCGTGACTTCAAGGGCTTTCCCCTCAAACTTTGTTTCTTGGTTCTACGGGCTG
>gi|7|emb|X51700.1| Bos taurus mRNA for bone Gla protein
GTCCACGCAGCCGCTGACAGACACACCATGAGAACCCCCATGCTGCTCGCCCTGCTGGCCCTGGCCACACTCTGCCTCGC
TGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGG
TGGTGAAGAGACTCAGGCGCTACCTGGACCACTGGCTGGGAGCCCCAGCCCCCTACCCAGATCCGCTGGAGCCCAAGAGG
GAGGTGTGTGAGCTCAACCCTGACTGTGACGAGCTAGCTGACCACATCGGCTTCCAGGAAGCCTATCGGCGCTTCTACGG
CCCAGTCTAGAGCTTGCAGCCCTGCCCACCTGGCTGGCAGCCCCCAGCTCTGGCTTCTCTCCAGGACCCCTCCCCTCCCC
GTCATCCCCGCTGCTCTAGAATAAACTCCAGAAGAGG
blast • 17k views
ADD COMMENT
0
Entering edit mode

Need to add > before identifier, in this case "gi|4|emb|X17276.1| Giant Panda satellite 1 DNA" and "gi|7|emb|X51700.1| Bos taurus mRNA for bone Gla protein". Header and sequence should be on separate lines.

>gi|4|emb|X17276.1| Giant Panda satellite 1 DNA
GATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCTGGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTCACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTCCGCGACCACGTTCCCTCATGTTTCCCTATTAACGAAGGGTGATGATAGTGCTAAGACGGTCCCTGTACGGTGTTGTTTCTGACAGACGTGTTTTGGGCCTTTTCGTTCCATTGCCGCCAGCAGTTTTGACAGGATTTCCCCAGGGAGCAAACTTTTCGATGGAAACGGGTTTTGGCCGAATTGTCTTTCTCAGTGCTGTGTTCGTCGTGTTTCACTCACGGTACCAAAACACCTTGATTATTGTTCCACCCTCCATAAGGCCGTCGTGACTTCAAGGGCTTTCCCCTCAAACTTTGTTTCTTGGTTCTACGGGCTG
>gi|7|emb|X51700.1| Bos taurus mRNA for bone Gla protein
GTCCACGCAGCCGCTGACAGACACACCATGAGAACCCCCATGCTGCTCGCCCTGCTGGCCCTGGCCACACTCTGCCTCGCTGGCCGGGCAGATGCAAAGCCTGGTGATGCAGAGTCGGGCAAAGGCGCAGCCTTCGTGTCCAAGCAGGAGGGCAGCGAGGTGGTGAAGAGACTCAGGCGCTACCTGGACCACTGGCTGGGAGCCCCAGCCCCCTACCCAGATCCGCTGGAGCCCAAGAGGGAGGTGTGTGAGCTCAACCCTGACTGTGACGAGCTAGCTGACCACATCGGCTTCCAGGAAGCCTATCGGCGCTTCTACGGCCCAGTCTAGAGCTTGCAGCCCTGCCCACCTGGCTGGCAGCCCCCAGCTCTGGCTTCTCTCCAGGACCCCTCCCCTCCCCGTCATCCCCGCTGCTCTAGAATAAACTCCAGAAGAGG
ADD REPLY
0
Entering edit mode
awk '{ if ( $0 ~ /^gi/ ) {gsub(" ","_",$0); print ">"$0 } else { print } }' in.txt > out.fasta
ADD REPLY
0
Entering edit mode

Hi

Thank you ... The ">" is already there in the file. Can i use the above command to execute it. Will it just replace the symbol ">" or it adds to the existing file?

ADD REPLY
2
Entering edit mode

If its already there, then its already a fasta file. Whats the problem then ? You just rename your file from text to .fasta

The above command does not work. If you want to replace space with "_",

awk '{ if ( $0 ~ /^>gi/ ) {gsub(" ","_",$0); print $0 } else { print } }' in.txt > out.fasta
ADD REPLY
0
Entering edit mode

Thank you. I used the above command for the complete nucleotide database it suddenly stops saying segmentation error core dumped. Is there any any other way i can do it??

ADD REPLY
0
Entering edit mode

The '>' gets auto-formatted on Biostars. So OP probably posted it.

The file you posted looks like a fasta already. How did you try to convert and what makes you think it didn't work?

ADD REPLY
0
Entering edit mode

Hi sir

I need to generate index file for the blast nt file to do blast search. I used the makeblastdb command to generate index file. I got an error during this process saying duplicate seqIds. To remove duplicate sequences in the nt file i tried with uclust. I could not do uclust since the nt file is in text but not in fasta format..

ADD REPLY
0
Entering edit mode

Is nt = nucleotide file or nt as in NCBI NT file?

If it is the first then do this

grep "^>" your_file.fa | sort | uniq -c

That should show you what ID's are duplicated. Edit the file to remove those duplicates.

ADD REPLY
0
Entering edit mode

It is NCBI NT file. Its a big file. If i can split this file into atleast two files i think i can remove the duplicates

ADD REPLY
0
Entering edit mode

Why are you creating your own indexes when you can download the pre-formatted from NCBI directly?

ADD REPLY
0
Entering edit mode

I downloaded all 39 NT files and extracted. When i run blast below command

blastn -query contigs.fa -db ntdb -outfmt 6 > known_sequences.blastx.nt.hits.txt

I got the below error.

BLAST Database error: No alias or index file found for nucleotide database [ntdb/] in search path

How can i solve this

ADD REPLY
0
Entering edit mode

You can't makeup your own blast db name. use -db nt (with full path if needed).

ADD REPLY
0
Entering edit mode

thank you :) it worked

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6