Entering edit mode
4.8 years ago
nadiabeg.comsats
▴
10
Hi all, I have a genome of size 7190 mb. I am using following python code to extract the flanking sequence nucleotide sequence around a list of variants. My code works for a smaller genome but it takes so long to process and fetch the sequence from a fasta file. Is there any way to speed up the process?
here is my code:
fr = open("cov_40/rep_1/test.vcf", "r")
a = fr.read()
aa = a.split("#CHROM")
aaa = aa[1].split("\n")
seq = ''
header = ''
sequence = {}
new_dict = {}
with open("Agria_reference_sequence_final.fasta") as fs:
for linee in fs:
if linee.startswith(">"):
header = linee.split("\n")[0]
seq = ''
# print(header)
else:
seq = seq + linee.split("\n")[0]
sequence.update({header: seq})
# print(seq)
for line in aaa[1:-1]:
index1 = line.split('\t')[0]
index = line.split()
# print(index[0],index[1],index[2],index[3],index[4] )
head = index[0] + " " + index[1] + " " + index[2] + " " + index[3] + " " + index[4] + " " + " " + index[5]
for keys, values in sequence.items():
if index1 in keys:
site = int(index[1]) - 1
start = abs(site - 10)
end = site + 50
seqq = values[start:end + 1]
new_dict.update({head: seqq})
fw = open("per_variants.txt", "a")
for base, basee in new_dict.items():
fw.write(str(base) + " " + str(basee) + "\n"
Have you considered bedtools flank?
I guess bedtools flank gives only intervals not complete sequence present around a given variant.
bedtools does it all! see: bedtools getfasta