Dear peers,
I have the results of WGS in the FASTQ, BAM and VCF formats to be interpreted using the commercial analysis platform. For the sake of the cost-effectiveness, I have to restrict the data to coding regions - sort of mimick WES. What would be the best way to do it?
So far I've come up with the preliminary solution to extract from VCF only those variants in coding exons of canonical transcripts ±12 intronic bp. A few questions:
How to make up such a BED file? Is there one already existing? Apart from the technical side of creating such a file, I'm confused with the lack of consensus on canonical transcripts, not mentioning the difference in coordinates between UCSC and Ensembl. Should I use the MANE, LRG, APPRIS P1, Ensembl Golden or TSL:1 transcripts or the ones at the intersection of these datasets?
Can I use the same approach for extracting coding portion of a BAM file? How should I do it?
Thank you for any suggestions. Cheers, Vera
In general this approach does not seem right to me because why, however, you can take all the transcripts from the UCSC, intersect them using bed intersect and "limit" your analysis to the region of interest.
2 - why would you do this? what will it change? your vcf - in theory - should remain the same.
Several members have invested effort into this question, it is therefore bad practice to delete the question. Others might benefit from it. Just leave it as it is.