Convert sequence file to fasta format using python
4
1
Entering edit mode
7.3 years ago
MAPK ★ 2.1k

Hi I am new in python and want to see how this can be done in python (I can do this in R). I have a text file myfile.txt with one column and thousands of rows as shown below. I want to convert this to fasta result.fasta format as shown below. How can I do this in python?

myfile.txt

ATGTGTGGTTTTCCCCC
ATTGGCGGGGTTTTTCAGGGG
ATGGGGGGGCCCCCCCCAAAAAA
TTGGTGGGGGGGGGGGGAA

result.fasta

>1
ATGTGTGGTTTTCCCCC
>2
ATTGGCGGGGTTTTTCAGGGG
>3
ATGGGGGGGCCCCCCCCAAAAAA
>4
TTGGTGGGGGGGGGGGGAA
python • 18k views
ADD COMMENT
5
Entering edit mode
7.3 years ago

Create a new script called ConvertFASTA.py:

import sys

#File input
fileInput = open(sys.argv[1], "r")

#File output
fileOutput = open(sys.argv[2], "w")

#Seq count
count = 1 ;

#Loop through each line in the input file
print "Converting to FASTA..."
for strLine in fileInput:

    #Strip the endline character from each input line
    strLine = strLine.rstrip("\n")

    #Output the header
    fileOutput.write(">" + str(count) + "\n")
    fileOutput.write(strLine + "\n")

    count = count + 1
print ("Done.")

#Close the input and output file
fileInput.close()
fileOutput.close()

Then, run it with:

python ConvertFASTA.py myfile.txt result.fasta
ADD COMMENT
2
Entering edit mode

For the OP, Kevin did a good job showing the pseudocode, and each step.This will run in the same fashion.

#!/usr/bin/env python
import sys
n = 0
with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            n += 1
            out.write('>' + str(n) + '\n' + line.strip())
ADD REPLY
0
Entering edit mode

I also have the similar query but I want to use the names of sequences to be used after the '>' symbol. for example:

Zebrafish ESLLRFGLRSDLDFR
Fugu ETVLSVGLSAETEIS
Chicken RALLAWGYSSDT

and I want:

result.fasta

>Zebrafish
ESLLRFGLRSDLDFR
 >Fugu
ETVLSVGLSAETEIS
>Chicken
RALLAWGYSSDT

Can I get some guidelines?

ADD REPLY
0
Entering edit mode

Sometimes if you try and search for this type of information you would not need to wait to get an answer. Here is one solution.

ADD REPLY
0
Entering edit mode

I am really thankful for your support. I would want to discuss that I receive an error,

    fileInput = open(sys.argv[1], "r")
IndexError: list index out of range

when I try using this solution (code) on windows OS and I do not use linux. Since according to my knowledge, argv is the built-in array of linux and so I guess it does not work when I run the script in python IDLE (3.6.4).

ADD REPLY
4
Entering edit mode
7.3 years ago

Something a bit simpler and should work in Python 2 and 3:

#!/usr/bin/env python

import sys

c = 1
for l in sys.stdin:
    sys.stdout.write(">%d\n%s\n" % (c, l))
    c += 1

Usage:

$ convert.py < in.txt > out.fa
ADD COMMENT
1
Entering edit mode

I think you want sys.stdout.write

ADD REPLY
1
Entering edit mode

You're write, thanks. Fixed!

ADD REPLY
0
Entering edit mode

Or even simpler:

import sys

for c, l in enumerate(sys.stdin, start=1):
    sys.stdout.write(">%d\n%s\n" % (c, l))
ADD REPLY
3
Entering edit mode
7.3 years ago
st.ph.n ★ 2.7k
#!/usr/bin/env python

n = 0
with open('myfile.txt', 'r') as f:
    for line in f:
        n += 1
        print('>' + str(n) + '\n' + line.strip())
ADD COMMENT
0
Entering edit mode

Thanks, but it does not increase the fasta identifier as >1, >2, >3.... All sequences are named >1.

ADD REPLY
0
Entering edit mode

See edit, wrote it too quickly :)

ADD REPLY
0
Entering edit mode

To be fair to all posters you should accept all answers that work.

ADD REPLY
0
Entering edit mode

I agree, is that feature not available? It would help users to see various ways of doing it. Apologies, I was composing my solution whilst the other guy had posted!

ADD REPLY
0
Entering edit mode

Thanks to everyone, all answers accepted! Wasn't aware of this feature. I was thinking it was similar to stackoverflow where you have option to accept only one answer.

ADD REPLY
1
Entering edit mode

One of the friendly features of Biostars. More than one ways of doing things and all instructive for those new to python.

ADD REPLY
2
Entering edit mode
7.3 years ago

FOUR! 🏌⛳️

$ python -c "import sys; [sys.stdout.write('>'+str(i)+'\n'+seq) for i, seq in enumerate(sys.stdin)]"
ADD COMMENT

Login before adding your answer.

Traffic: 2138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6