Hello,
I'm using a vcf file for some filtering using SnpSift. I would like to get mutation counts that alter TFBS. [Check this paper - https://www.frontiersin.org/articles/10.3389/fgene.2012.00100/full#h7] Check the Table 1 (https://www.frontiersin.org/files/Articles/18778/fgene-03-00100-HTML/image_m/fgene-03-00100-t001.jpg)
I would like to get something like this.
I used multiple commands and added annotation and the vcf file looks like following. It has "TF_binding_site_variant" and Vartype showing SNP/DEL/IND/MNP.
#CHROM POS ID REF ALT QUAL FILTER INFO
1 100225517 MU3692753 A G . . CONSEQUENCE=FRRS1|ENSG00000156869|1|FRRS1-001|ENST00000287474||intron_variant||,FRRS1|ENSG00000156869|1|FRRS1-004|ENST00000370176||intron_variant||,FRRS1|ENSG00000156869|1|FRRS1-201|ENST00000414213||intron_variant||;OCCURRENCE=LIRI-JP|1|258|0.00388;affected_donors=1;mutation=A>G;project_count=1;studies=PCAWG;tested_donors=12198;ANN=G|TF_binding_site_variant|LOW|||FOXA2|MA0047.2|||n.100225517T>C||||||,G|TF_binding_site_variant|LOW|||FOXA1|MA0148.1|||n.100225517T>C||||||,G|intron_variant|MODIFIER|FRRS1|ENSG00000156869|transcript|ENST00000287474|protein_coding|1/16|c.-106+5336T>C||||||,G|intron_variant|MODIFIER|FRRS1|ENSG00000156869|transcript|ENST00000414213|protein_coding|1/16|c.-106+5336T>C||||||,G|intron_variant|MODIFIER|FRRS1|ENSG00000156869|transcript|ENST00000370176|retained_intron|1/2|n.25+6646T>C||||||;SNP;HOM;VARTYPE=SNP
1 100274466 MU2855033 T C . . CONSEQUENCE=||||||intergenic_region||;OCCURRENCE=LIRI-JP|1|258|0.00388;affected_donors=1;mutation=T>C;project_count=1;studies=PCAWG;tested_donors=12198;ANN=C|TF_binding_site_variant|LOW|||Srf|MA0083.1|||n.100274466A>G||||||,C|intergenic_region|MODIFIER|Y_RNA-AL451051.1|ENSG00000202254-ENSG00000252226|intergenic_region|ENSG00000202254-ENSG00000252226|||n.100274466T>C||||||;SNP;HOM;VARTYPE=SNP
1 101774964 MU78905029 T G . . CONSEQUENCE=||||||intergenic_region||;OCCURRENCE=LIRI-JP|1|258|0.00388;affected_donors=1;mutation=T>G;project_count=1;studies=PCAWG;tested_donors=12198;ANN=G|TF_binding_site_variant|MODIFIER|||CTCF|MA0139.1|||n.101774964T>G||||||,G|intergenic_region|MODIFIER|PPIAP7-RP11-157N3.1|ENSG00000173810-ENSG00000231671|intergenic_region|ENSG00000173810-ENSG00000231671|||n.101774964T>G||||||;SNP;HOM;VARTYPE=SNP
1 101774966 MU3316414 A C . . CONSEQUENCE=||||||intergenic_region||;OCCURRENCE=LIRI-JP|1|258|0.00388;affected_donors=1;mutation=A>C;project_count=1;studies=PCAWG;tested_donors=12198;ANN=C|TF_binding_site_variant|MODIFIER|||CTCF|MA0139.1|||n.101774966A>C||||||,C|intergenic_region|MODIFIER|PPIAP7-RP11-157N3.1|ENSG00000173810-ENSG00000231671|intergenic_region|ENSG00000173810-ENSG00000231671|||n.101774966A>C||||||;SNP;HOM;VARTYPE=SNP
I checked few filtering steps in the documentation, but couldn't find anything that shows number of each mutations that affect TFBS.
I tried something like this but didn't work: [just to check - how many number of variant_type Deletion alters transcription factor binding sites.
cat input.vcf | java -jar SnpSift.jar filter "((exists DEL) & (ANN[*].EFFECT)" > eg.vcf
Needed help in this. Thank you !!
may be I'm wrong but I don't think snpEff/snpsift is able to annotate a vcf at this level of precision (eg.: a "TFB context"). Those tools are "just" able do some basic annotation, e.g: the terms under: http://www.sequenceontology.org/browser/release_2.5/term/SO:0001564
But you can see in the above few lines from vcf -
ANN=C|TF_binding_site_variant|LOW|||Srf|MA0083.1|||n.100274466A>G||||||,C|intergenic_region|MODIFIER|Y_RNA-AL451051.1|ENSG00000202254-ENSG00000252226|intergenic_region|ENSG00000202254-ENSG00000252226|||n.100274466T>C||||||;SNP;HOM;VARTYPE=SNP
Which means [TF_binding_site_variant|LOW|||Srf|MA0083.1] corresponding to motif MA0083.1, which you can look up in Jaspar database.
So, I would like to count the number of each type of mutations altering TFBS or motif
You can check this in SnpEff documentation - Additional Annotations - Go to Motif [Subheading] (http://snpeff.sourceforge.net/SnpEff_manual.html#run)
ok so I'm wrong :-)
This is never a good description. What do you expected? What is the result you get instead?
Please post a full vcf example inlucding the header.
fin swimmer
@OP: All the example vcf records, you furnished above are SNVs and I am not sure if any one of SNVs lead to deletion to something. You should be looking at INDELs in your vcf. Example filtering that worked for example annotaiton using snpsift:
output:
input:
Yes, I do see that in the SnpEff documentation. But I want to find which mutations alter TFBS/motif
Since you are looking for numbers (not records, If I understand correct), just do a grep and count (on OP records, it should give 2):
If you are looking for records, use following filter on OP vcf (two records will be listed):
output using OP records:
if you would like to fitler any variant with TF_binding effect use:
No this is not the one I'm telling. You can see there is also see in the input showing VARTYPE = SNP/IND/DEL/MNP. What I want is to count the number of varainttypes altering TFBS/motif. It should give something like this [See the first two columns - https://www.frontiersin.org/files/Articles/18778/fgene-03-00100-HTML/image_m/fgene-03-00100-t001.jpg]
If you are looking for summary, then you look into summary.html from snpeff