I have downloaded Severe acute respiratory syndrome coronavirus 2 isolate FASTA sequence but when I tried to get the exact length and C and G count I got the wrong answer every time.
Script:
genome = open("C://Users//USER//Desktop/Corona.fasta")
dna = genome.read()
dna1 = dna.rstrip("\n")
print(len(dna1))
The exact length of the sequence is 29903 without a header 30018 but I don't understand why I am not getting 30018 although I am using strip function as well
Plz help me
Please use biopython or other fasta parsing libraries, for reading fasta files and subsequent operations. Post example file for better understanding the query.
I have already solved it by using biopython but I want to do it without using any library?
Check this code. This prints fasta header, length of the sequence, Number of G and C, % GC. It strips all the spaces in the seqence. This code expects the sequence in a single line in a fasta record
what is with?
It is also producing wrong length and GC content
You have missed this part of my post: This code expects the sequence in a single line in a fasta record. From your post, I can see that fasta record used is multiline record.
Following is the comparison of results from fairly known tool and the script and I have used this sequence for test:
GC content from the script : 0.379715088282504 and GC content from the tool: 37.97. Code doesn't multiply GC content by 100.
which tool you are using?
It says quite clearly in the command - seqkit.