I have a callset from whole-genome data and with this callset, I want to transform it into exome callset by extracting the variants using a exome target interval. I obtained two exome target list, one from 1KG project (Phase 3) and the Twist Exome target (https://www.twistbioscience.com/resources/bed-file/ngs-human-core-exome-panel-bed-files). I have some questions:
Are these exome intervals appropriate to get the exon variants? If so, which one should I use? I subset my VCF file with both lists and for 1KG I got 68463 variants while with Twist Exome target, I got 21082.
Looking at the annotations for the subset (regardless if it was with 1KG or Twist), I get variants annotated as introns even if they are tagged as protein coding transcripts. Does this makes sense for an exome target list?
There is no universal, consensus whole-exome annotation. There are, however, various platforms and versions of whole-exome library kits, with their accompanying annotations. These kit-specific annotations are made using a particular version of the human genome and annotation. You need to subset your whole-genome calls against the particular annotation of the kit in question, which means you either have to map against the same genome version, or you have to convert (e.g., with liftOver) the coordinates between different genome builds.
Sometimes, the upstream genome annotation is updated but kit manufacturers often lag behind, keeping an outdated annotation - this could explain the intron / coding discrepancies you observed.
For 1KG target list, I converted it into Hg38 genomic coordinates while Twist interval list is already in Hg38 genomic coordinates. My VCF file was annotated with dbSNP build 138. Is it possible that presence of intron variants are due to the dbSNP version?
Thank you very much for your answer!
For 1KG target list, I converted it into Hg38 genomic coordinates while Twist interval list is already in Hg38 genomic coordinates. My VCF file was annotated with dbSNP build 138. Is it possible that presence of intron variants are due to the dbSNP version?
It is possible, but you would have to check to confirm if this is the case. It could also be an error in coordinate conversion.