Remove character from Fasta IDs -- python
2
0
Entering edit mode
5.1 years ago

Hi,

Sorry if this questions seems obvious, I am pretty new at python. So I need some help to complete this code. I have a fasta file that looks like this:

>gene_name1|other
AAAAAAAATTTTTA
>gene_name2|other
TTTTTGGGGGAAA
>|gene_name3
TTTTTTTCCCCCCC
>|gene_name4
AAAAAATTTTTTTCC
....

Ideally I want to remove | whenever it appears at the beginning of the ID, not anywhere else. So I wrote a python code that does that, but I cannot get the output I want into a file. I can however get the output and copy/paste and arrange it myself. But I would like to find a solution on python.

My code so far:

from Bio import SeqIO
original_file = "sequences.fasta"
corrected_file = "./corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
records = SeqIO.parse(original_file, 'fasta')

for record in records:
if record.id[0]) == "|":
    print (">",record.id[1:], "\n",
           record.seq)
else:
    print(">",record.id, "\n", record.seq)

result:

>gene_name1|other
AAAAAAAATTTTTA
>gene_name2|other
TTTTTGGGGGAAA
>gene_name3
TTTTTTTCCCCCCC
>gene_name4
AAAAAATTTTTTTCC

Can anyone help me to correct the code and print the output onto a fasta file. Thanks !

python • 2.1k views
ADD COMMENT
2
Entering edit mode

or not using python : sed 's/^>|/>/' input.fa

ADD REPLY
0
Entering edit mode

Love it! But I am trying to learn python. Still super appreciated.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

You can use a list to collect all the corrected lines first, then write them into a file.

ADD REPLY
0
Entering edit mode

No need to store the reads, that's only going to use memory

ADD REPLY
1
Entering edit mode
5.1 years ago
Eric Lim ★ 2.2k

Since you're already using SeqIO to process incoming fasta, the easiest and quickest way is to reuse record and replace print with SeqIO.write. Everything else is the same.

for record in records:
  if record.id.startswith('|'):
    record.id = record.id[1:]
    record.description = ''
  SeqIO.write(record, corrected, 'fasta')

You should still follow the link that @JC posted to learn about general reading/writing in Python outside of the biopython's ecosystem.

ADD COMMENT
0
Entering edit mode

Thanks a lot, Perfect, I added my else statement too. In case this can help others. I am posting the final version.

    for record in records:
        if record.id.startswith('|'):
            record.id = record.id[1:]
            record.description = ''
        else:
            printrecord.id)
        SeqIO.write(record, corrected, 'fasta')
ADD REPLY
0
Entering edit mode
5.1 years ago
Corentin ▴ 610

You are almost there, instead of printing the results to the screen with "print()" you need to use the filehandler you created: "corrected" to write to the file.

For this you can use the "write()" method, as described in the python documentation (https://docs.python.org/3/tutorial/inputoutput.html):

f.write(string) writes the contents of string to the file, returning the number of characters written.

With "f" being the file handler (the variable you create with the "as x")

However, with your current code, you are not using "with open()" correctly:

  • First, you are never using the "original" file handler, you are just using the filename (because you are using the variable "original_file" as argument of the parse() method.
  • Second, when using "with open()" you should write the code dealing with the files inside an indented bloc, for example:

    with open("test.txt", "w") as f:
        f.write("this will be written in test.txt")
    

That is because when you close the indentation for the "with open" it automatically closes the filehandler, so you do not have access to the file contents anymore.

ADD COMMENT

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6