how to extract REF and ALT DNA string around SNV
1
0
Entering edit mode
3.7 years ago

Hi, everyone

I want to reproduce the operation about Motif identification at positions of SNVs, which is the part of "The chromatin accessibility landscape of primary human cancers"

The detailed method is below:

To identify motifs whose binding affinity is predicted to be most perturbed by a given SNV, we constructed a REF and VAR string (+/- 10 bp from the SNV, 21 bp in length) using the hg38 BSGenome reference sequence. We then used motifmatchr with CIS-BP motifs to calculate the motif positions and motif scores (representing the significance of the motif match) using motifmatchr “matchMotifs(positions = “out”, p.cutoff=0.01)” in both the REF and VAR string. Then, we intersected the motif positions with the SNV in both the REF and VAR string to ensure the motifs overlap the SNV. For each motif present, the motif’s score differential was computed if the motif was present at the same position in both the REF and VAR. If there was no motif present at the same position in both the REF and VAR, the differential was compared to a motif score of “0”.

For now, I just wan to know how to constructed a REF and VAR string (+/- 10 bp from the SNV, 21 bp in length). I try to use injectSNPs from BSgenome, but it need the SNPlocs-class, which is hard to get for me. So I am wondering whether some one can give me some other advices.

PS: the bioconductor support link I posted about SNPlocs-class is here

Best wishes

Guandong Shang

SNV motif • 1.5k views
ADD COMMENT
1
Entering edit mode

You may be able to use GATK FastaAlternateReferenceMaker (LINK) for this.

ADD REPLY
0
Entering edit mode

Thanks! this look good :) I will try it.

ADD REPLY
2
Entering edit mode
3.7 years ago
David Parry ▴ 170

With a simple python script:

import sys
import pysam
from pyfaidx import Fasta


def main(vcf, fasta):
    fa = Fasta(fasta, as_raw=True)
    with pysam.VariantFile(vcf) as variants:
        for record in variants:
            if len(record.alleles) != 2: continue
            if len(record.ref) != 1 or len(record.alts[0]) != 1: continue
            ref_string = fa[record.chrom][record.start - 10:record.stop + 10]
            alt_string = ref_string[:10] + record.alts[0] + ref_string[11:]
            print("\t".join((record.chrom, str(record.pos), ref_string, alt_string)))


if __name__ == '__main__':
    if len(sys.argv) != 3:
        sys.exit("Usage: {} input.vcf reference.fa".format(sys.argv[0]))
    main(*sys.argv[1:])

You will need to install pysam and pyfaidx (e.g. python3 -m pip install pysam pyfaidx). The output gives you chromosome, snp position, reference and alternate strings separated by tabs.

ADD COMMENT
0
Entering edit mode

Thanks! I will try it.

ADD REPLY

Login before adding your answer.

Traffic: 1749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6