Remove spaces from fasta file in python
1
0
Entering edit mode
3.5 years ago
anasjamshed ▴ 140

I have downloaded Severe acute respiratory syndrome coronavirus 2 isolate FASTA sequence but when I tried to get the exact length and C and G count I got the wrong answer every time.

Script:

genome = open("C://Users//USER//Desktop/Corona.fasta")
dna = genome.read()
dna1 = dna.rstrip("\n")
print(len(dna1))

The exact length of the sequence is 29903 without a header 30018 but I don't understand why I am not getting 30018 although I am using strip function as well

Plz help me

Python • 3.5k views
ADD COMMENT
1
Entering edit mode

Please use biopython or other fasta parsing libraries, for reading fasta files and subsequent operations. Post example file for better understanding the query.

ADD REPLY
0
Entering edit mode

I have already solved it by using biopython but I want to do it without using any library?

ADD REPLY
0
Entering edit mode
seq=[]
with open ("test.fa","r") as w:
    for i in w:
        if i.startswith (">"):
            _head=i.strip("\n>")
            _seq= next(w, '').strip('').replace(" ","")
            gc_count=sum(map(_seq.upper().count,["G","C"]))
            gc_perc=gc_count/len(_seq)
            _temp_seq=[_head, len(_seq),gc_count,gc_perc]
            seq.append(_temp_seq)

[print (i) for i in seq] 

Check this code. This prints fasta header, length of the sequence, Number of G and C, % GC. It strips all the spaces in the seqence. This code expects the sequence in a single line in a fasta record

ADD REPLY
0
Entering edit mode

what is with?

ADD REPLY
0
Entering edit mode
['gi|1798174254|ref|NC_045512.2| Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', 71, 25, 35.2112676056338]

It is also producing wrong length and GC content

ADD REPLY
0
Entering edit mode

You have missed this part of my post: This code expects the sequence in a single line in a fasta record. From your post, I can see that fasta record used is multiline record.

Following is the comparison of results from fairly known tool and the script and I have used this sequence for test:

$ python test.py 
['NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', 29904, 11355, 0.379715088282504]

$ seqkit fx2tab -n -g seq.fa 
NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome 37.97

GC content from the script : 0.379715088282504 and GC content from the tool: 37.97. Code doesn't multiply GC content by 100.

ADD REPLY
0
Entering edit mode

which tool you are using?

ADD REPLY
0
Entering edit mode

It says quite clearly in the command - seqkit.

ADD REPLY
1
Entering edit mode
3.5 years ago

There are numerous examples of substring counting that can be found with a bit of googling, e.g. here.

Please post a link to the sequence/file in question if you want specific help with your issue. My guess is that it could be a platform-specific issue (try dna.rstrip("\n\r"), which should remove all newline characters. Alternatively, just plan dna.rstrip() should also remove all newline and whitespace characters (spaces, tabs).

You are reading the whole file into a single string, from which only the last newline character is being stripped. It is easier to replace all instances of the newline character since it is going to be interspersed throughout your read.

dna1 = dna.replace('\n', '')
ADD COMMENT
0
Entering edit mode

rstrip just removing 2 spaces

ADD REPLY
0
Entering edit mode

Oh, you are reading in the entire file at once, so my answer will indeed only remove the last newline character. I will edit my answer.

ADD REPLY
0
Entering edit mode

how I remove it?

ADD REPLY
0
Entering edit mode

how can I remove the header?

ADD REPLY
1
Entering edit mode

You can just read the file line by line while trimming and skip the header:

dna = []
with open("C://Users//USER//Desktop/Corona.fasta") as genome:
    # Skip header.
    genome.readline()

    # Iterate through each line, strip newline, and add to your dna list.
    for line in genome:
        line = line.strip()
        dna.append(line)

# Concatenate all elements of the list to a string with no space between them.
dna1 = "".join(dna)
print(len(dna1))
ADD REPLY

Login before adding your answer.

Traffic: 1155 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6