there are quite a few more lines of the sequence obviously but this is how it is formatted. it looks like genbank but it is in a .txt i am not sure how i can do that conversion either.
I was able to write the code to take every line in the txt document and make them individual items in the list, but i cant just use a function that removes the first 8 lines because the code needs to be usable in other documents that may differ from this one. So how do i just search for the nucleotide sequences?
edit: didnt realize the formatting reverted. The file i am using is *.txt
Why not just parse the ACCESSION lines, get the accession and download the sequences in FASTA format using Entrez Direct? Do you expect these sequences to be somehow different from the versions that are public?
realistically, if there was a way to just remove the header without having to convert the file, that would also work. However, since it needs to be usable for multiple possible documents i cannot use like range(1,9)
import re
with open("dna_sequence.txt") as fh:
mydna = [line for line in fh if re.match(r'^[AGCT]+$', line)]
#print(mydna)
complement = ""
for i in mydna:
for x in i:
if x == 'A':
complement += 'T'
elif x == 'G':
complement += 'C'
elif x == 'C':
complement += 'G'
elif x == 'T':
complement += 'A'
print(complement)
The logic behind this is, that the first lines have fields delimited by :. So there are at least 2 fields. The sequence itself have no :, so the field number is 1.
import sys
with open(sys.argv[1], "r") as genefile:
for line in genefile:
fields = line.split(":")
if(fields[0] == "ACCESSION"):
print(">"+fields[1].strip())
if(len(fields) == 1):
print(fields[0].strip())
I am sure there are more efficient and concise way of doing this, but this gets you the desired result.
#!/usr/bin/Python
# file to write the results
fasta = open("resultingFasta.fa","a")
# list to capture results
result = []
with open("dna.txt") as fh:
for line in fh:
if(line.startswith(("LOCUS","ORGANISM","AUTHORS","TITLE","JOURNAL","PUBMED","SOURCE")) ):
# skip for lines starting with these
continue
elif(line.startswith("ACCESSION")):
# capture accession ID
line = line.rstrip() # skip any trailing characters
acc = line.split(": ") # splitting ACCESSION: U49845
result.append(">"+acc[1])# append accession ID to a list
else:
# the line with nucleotide sequence
line = line.rstrip()
result.append(line) # append result to a list
for x in result:
fasta.write(x+"\n") # writing it out to a file
Why not just parse the
ACCESSION
lines, get the accession and download the sequences in FASTA format using Entrez Direct? Do you expect these sequences to be somehow different from the versions that are public?Can you update your post to include the format of the sequences in the context of the header too?
That looks like a part of GenBank format file, which is text based. Only some of the lines appear to have been retained.
realistically, if there was a way to just remove the header without having to convert the file, that would also work. However, since it needs to be usable for multiple possible documents i cannot use like range(1,9)
You could perhaps check to see if lines have [ACTGN] and nothing else. Retain those/remove others.
How about a customisable parameter for how many header lines to skip? Or are you hoping that the script can ‘figure this out for itself’?
yeah i am hoping the script can figure that out on its own.
the other thing i came up with is
but i keep getting an error saying that
for the
dna = pd.read_csv(sep = "\n", header = None, names = ['code'])
line and i am not sure what it is talking about.because i didnt add my file path haha, but now im getting an unkown error for invalid syntax