I'm currently in the process of analyzing whole-exome and RNA sequencing data on a cancer cell line and attempting to see how many genes consists of deleterious mutations.
I have performed quality control, alignment/mapping (BWA for WES and STAR for RNA-Seq), and variant calling (VarScan).
The VCF file returned was given as a input to ENSEMBL's Variant Effect Predictor (VEP), and I plan to filtering the output so that it consists of SNPs annotated as deleterious.
I quickly examined the HTML file containing statistics (default output provided by VEP), and noticed that there were large number of overlapped genes/transcripts reported by the tool.
Should I be concerned with such large numbers? Is there something I am missing or should be looking out for? Any input would be greatly appreciated.
Thank you.
Hello newbio17,
what do you mean by "large numbers" and why do you worry about this? If I'm doing WES and RNA sequencing I would expect that (nearly) all my variants overlap a transcript of a gene.
Furthermore AFAIK VEP reports for every transcript that overlaps the variant. One gene can have multiple transcripts.
fin swimmer
Hi finswimmer,
Thank you for your input.
It's my first time working with WES and RNA-Seq data so everything is new to me. As a reference, below are the statistics VEP reported for the run. To clarify, it seemed to me that the number reported for overlapped genes with respect to number of variants processed was a little high.
General statistics
Honestly, I'm surprised that you sequenced a whole exome and only identified variants in 9564 genes. Given the frequency of variants in any individual, I would have thought you'd have variants in every gene.