Question

Flanking sequence retreival

0

Entering edit mode

5.3 years ago

nadiabeg.comsats ▴ 10

Hi all, I have a genome of size 7190 mb. I am using following python code to extract the flanking sequence nucleotide sequence around a list of variants. My code works for a smaller genome but it takes so long to process and fetch the sequence from a fasta file. Is there any way to speed up the process?

here is my code:

fr = open("cov_40/rep_1/test.vcf", "r")
a = fr.read()
aa = a.split("#CHROM")
aaa = aa[1].split("\n")

seq = ''
header = ''
sequence = {}
new_dict = {}


with open("Agria_reference_sequence_final.fasta") as fs:
    for linee in fs:
        if linee.startswith(">"):
            header = linee.split("\n")[0]
            seq = ''
            # print(header)

        else:
            seq = seq + linee.split("\n")[0]
            sequence.update({header: seq})
            # print(seq)

for line in aaa[1:-1]:
    index1 = line.split('\t')[0]
    index = line.split()
    # print(index[0],index[1],index[2],index[3],index[4] )
    head = index[0] + " " + index[1] + " " + index[2] + " " + index[3] + " " + index[4] + " " + " " + index[5]
    for keys, values in sequence.items():
        if index1 in keys:
            site = int(index[1]) - 1

            start = abs(site - 10)
            end = site + 50
            seqq = values[start:end + 1]
            new_dict.update({head: seqq})

fw = open("per_variants.txt", "a")
for base, basee in new_dict.items():
    fw.write(str(base) + " " + str(basee) + "\n"

Variants fasta sequence extraction • 1.7k views

ADD COMMENT • link updated 5.3 years ago by finswimmer 16k • written 5.3 years ago by nadiabeg.comsats ▴ 10

0

Entering edit mode

Have you considered bedtools flank?

ADD REPLY • link 5.3 years ago by igor 13k

0

Entering edit mode

I guess bedtools flank gives only intervals not complete sequence present around a given variant.

ADD REPLY • link 5.3 years ago by nadiabeg.comsats ▴ 10

1

Entering edit mode

bedtools does it all! see: bedtools getfasta

ADD REPLY • link 5.3 years ago by igor 13k

score 1 · Answer 1 · 2020-01-21

1

Entering edit mode

5.3 years ago

rbagnall ★ 1.8k

Extract subsequence from indexed reference sequence using samtools faidx.

samtools faidx your.fasta chrX:start-end

ADD COMMENT • link 5.3 years ago by rbagnall ★ 1.8k

0

Entering edit mode

thanks, but I have to pick variant location from vcf file and it contains millions of variants. I am not sure how long will it takes to fetch sequence for each variant position.

ADD REPLY • link 5.3 years ago by nadiabeg.comsats ▴ 10

1

Entering edit mode

Samtools

   time samtools faidx Homo_sapiens_assembly38.fasta chr8:300000-300050
>chr8:300000-300050
GACCACTTGTTCTGCATTTTCTCCATCTTCCTTGTGATTAGAAACCTCAAA

real    0m0.174s
user    0m0.003s
sys 0m0.024s

Bedtools

    time bedtools getfasta -fi Homo_sapiens_assembly38.fasta -bed test.bed
>chr8:300000-300050
ACCACTTGTTCTGCATTTTCTCCATCTTCCTTGTGATTAGAAACCTCAAA

real    0m0.012s
user    0m0.007s
sys 0m0.003s

Bedtools wins!

ADD REPLY • link 5.3 years ago by rbagnall ★ 1.8k