Extract SNPs flanking sequences based on VCF and genome Fasta files
4
3
Entering edit mode
6.3 years ago
Denis ▴ 310

I have VCF file (containing diallelic variants) and reference genome in Fasta of some non-model plant. I'd like to extract SNPs flanking sequences. I've found that bedtools and samtools faidx could be to some extent useful, but apparently don't solve the issue. I need to get output (fasta or tabular file) with following SNPs representation : TCTCTGCCAATCACTAGAGGCCGCTTTCGCTTTTA[A/G]TTTGTGTGTGGTCAGAGTTCTTCCGGACTTT

snp sequence genome • 8.0k views
ADD COMMENT
5
Entering edit mode
6.3 years ago

Here a little python solution:

import pysam


# open vcf file
vcf = pysam.VariantFile("input.vcf")
# open fasta file
genome = pysam.FastaFile("genome.fa")
# define by how many bases the variant should be flanked
flank = 50

# iterate over each variant
for record in vcf:
    # extract sequence
    #
    # The start position is calculated by subtract the number of bases
    # given by 'flank' from the variant position. The position in the vcf file
    # is 1-based. pysam's fetch() expected 0-base coordinate. That's why we
    # need to subtract on more base.
    #
    # The end position is calculated by adding the number of bases
    # given by 'flank' to the variant position. We also need to add the length
    # of the REF value and subtract again 1 due to the 0-based/1-based thing.
    #
    # Now we have the complete sequence like this:
    # [number of bases given by flank]+REF+[number of bases given by flank]
    seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

    # print out tab seperated columns:
    # CRHOM, POS, REF, ALT, flanking sequencing with variant given in the format '[REF/ALT]'
    print(
        record.chrom,
        record.pos,
        record.ref,
        record.alts[0],
        '{}[{}/{}]{}'.format(seq[:flank], record.ref, record.alts[0], seq[flank+len(record.ref):]),
        sep="\t"
    )

fin swimmer

EDIT:

I rewrote the code. The version before had an off-by-one error.

ADD COMMENT
0
Entering edit mode

@finswimmer Hi! Many thanks. I need some time for the code understanding.

ADD REPLY
1
Entering edit mode

I added some comments to the code. If something isn't clear just ask.

ADD REPLY
0
Entering edit mode

Hi swimmer,

thanks so much for this solution! I used it in my thesis project and would like to reference it properly and cite the source. Just wondering if you have a GitHub repository or something similar that I could refer to? many thanks Laura

ADD REPLY
2
Entering edit mode
6.3 years ago

A one-liner using biostar251649 ( http://lindenb.github.io/jvarkit/Biostar251649.html ) and bioalcidaejdk ( http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html )

    $ java -jar dist/biostar251649.jar -R src/test/resources/rotavirus_rf.fa src/test/resources/rotavirus_rf.vcf.gz -n 20 |\
    java -jar dist/bioalcidaejdk.jar -F VCF -e 'stream().forEach(V->println(">"+V.getContig()+":"+V.getStart()+"\n"+V.getAttribute("SEQ5_20")+"["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]"+V.getAttribute("SEQ3_20")));'

first command adds two attribute to the VCF containing the flanking sequences.

the second command:

  • stream() convert to a stream of variant

  • forEach(V->println(">" for each variant start writing the fasta header

  • V.getContig()+":"+V.getStart()+"\n" write the fasta header

  • V.getAttribute("SEQ5_20") write 5' seq

  • ["+V.getAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining("/"))+"]" write all alleles separated with '/'

  • V.getAttribute("SEQ3_20"))) write 3' seq

output:

>RF01:970
TGGTCGATTGCTCTATTGAA[A/C]AATTTCCATTGATGGCTAAA
>RF02:251
TGCTTGAAGTTTTGAAAACA[A/T]AAGAAGAGCACCAAAAAGAG
>RF02:578
TACTATTAAAAGATATGGCA[G/A]TTGAAAATAAAAATTCGAGA
>RF02:877
GACGTTAACTATATACTTAA[T/A]ATGGACAGAAATCTGCCATC
>RF02:1726
AATATGCAACATGTTCAGAC[T/G]TTGACAACAGAAAAATTACA
>RF02:1962
TGAAGATTTTTTAAAAAGAT[TACA/TA]TATTTTCGATGTAGCTAGAG
>RF03:1221
TGACTGCCATGGATTTAGAA[C/G]TACCGATATCTGCGAAATTA
>RF03:1242
TACCGATATCTGCGAAATTA[C/A]TACACCACCCAACTACAGAA
>RF03:1688
TCAAAACCAACTGGAAATAA[T/G]CATCTGTTCATTTTGAGCGG
>RF03:1708
TCATCTGTTCATTTTGAGCG[G/T]TACTAATAAATATTTTAAAT
>RF03:2150
GCCGATGATCCGAACTATTT[T/A]ATTGGAATTAAGTTCAAAAA
>RF03:2201
GAATATGATGTTAAAGTACC[G/C]CATCTTACATTTGGTGTATT
ADD COMMENT
0
Entering edit mode
6.3 years ago
Denis ▴ 310

I slightly modified the script provided by finswimmer. Now it's compatible with python 2.X, writes output to the text file instead of STDOUT. Besides, ID of each SNP is present in output file as separate field. Probably, some additional optimisation and syntax polishing are required, but even this version works fine for me.

from __future__ import print_function
import pysam



f = open("SNP_CDS.txt", "w")

vcf = pysam.VariantFile("test.vcf")

genome = pysam.FastaFile("genome.fasta")

L=[]

flank=10



for record in vcf:

    seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
    t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])
    L.append(t)



for i in L:

    seq,chr,pos,id,ref,alt=i

    sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

    line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

    f.write(line+'\n')



f.close()
ADD COMMENT
2
Entering edit mode

Hello Denis,

Probably, some additional optimisation and syntax polishing are required

if you like, here are my notes about this:

  1. Use python3 :) Whenever you write a new script with python take python3 instead of python2. No one knows how long python2 will be supported
  2. In bioinformatics it is common that program writes to stdout by default. Doing so the user can decide whether it should be piped to another program or the output is redirected to a file with $ script.py > outputfile
  3. When working with reading/writing files it is a good style to work with the with statement. Doing so you don't need to take care of closing files.
  4. Theres no need to first append your data to a list and iterate afterwards over that list to write the output to the file. Just write it directly to it.

fin swimmer

ADD REPLY
0
Entering edit mode

Hello finswimmer! Agree! Many thanks for your comments! They are extremely useful. I've changed the code (added with statement). Please see the newer version below. I'm wondering how can i adjust the code regarding your the 4-th point? Could you do it please?

ADD REPLY
0
Entering edit mode
6.3 years ago
Denis ▴ 310

Improved version

from __future__ import print_function
import pysam


    vcf = pysam.VariantFile("Sun_CDS.vcf")

    genome = pysam.FastaFile("genome.fasta")

    flank=62


    with open("SNP_CDS.txt", "w") as f:

        for record in vcf:

            seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)

            t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])

            seq,chr,pos,id,ref,alt=t

            sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])

            line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)

            f.write(line+'\n')
ADD COMMENT
0
Entering edit mode

Hi Denis, I used following code to do exactly what you are doing using pysam. Problem is my code is not very fast for bigger genomes. So I switched to the aforementioned code posted by you. It works when you set a lower flank score. Suppose I need a flanking sequence of 500bp on 5' ......pos......3' and I have first variant at pos [25] it throws an error. Could you suggest me a way to deal with this problem? How can I modify your code that if the variant is a pos 25 and it does not finds a 500 bp flanking sequence at start, it must pick the total bases present at start.

My code is given below. It works for smaller genomes but takes longer for large genomes and It also works for variants having flanking sequence present at start less than the mentioned flank value.

fr=open("test.vcf","r")
a=fr.read()
aa=a.split("#CHROM")
aaa=aa[1].split("\n")
seq=''
header=''
sequence={}
new_dict={}
with open("genome.fasta") as fs:

for linee in fs:

       if linee.startswith(">"):

            header=linee.split("\n")[0]
            seq=''

       else:
             seq=seq+linee.split("\n")[0]
             sequence.update({header:seq})



for line in aaa[1:-1]:
index1 = line.split('\t')[0]
index= line.split()

head=">"+index[0]+" "+index[1]+" "+index[2]+" "+index[3]+" "+index[4]+" "+" "+index[5]+" "+index[6]+" "+index[7]+" "+index[8]+" "+index[9]+" "+index[10]+" "+index[11]+" "+index[12]+" "+" "+index[13]+" "+index[14]+" "+index[15]+" "+index[16]+" "+index[17]

for keys, values in sequence.items():
    if index1 in keys:
                   site=int(index[1])-1  

                   start=abs(site-500)  #5' direction
                   end=site+500   #3' direction
                   seqq=values[start:end+1]
                   new_dict.update({head:seqq})
  print(new_dict)

  fw=open("sequences_per_variants.txt","w")
   for base,basee in new_dict.items():
   fw.write(str(base)+" "+str(basee)+"\n")
   fw.close()
ADD REPLY
0
Entering edit mode

Hi, It seems that you have to add additional block with if statement to check if record.pos-1-flank is greater than 0. Perhaps the same aproach and for record.pos-1+len(record.ref)+flank it should be smaler or at least equal than chromosome length. But i'm not sure as i din't run it.

ADD REPLY

Login before adding your answer.

Traffic: 2310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6