what should be taken care of during the bioinformatic analysis pipeline for targeted resequencing, compared with whole genome resequencing? Esp. considering the difference of data between targeted resequencing and whole genome resequencing, What can be done to optimise the bioinformatics pipeline for the targeted reseq. For example, can the use of parameter --intervals
in GATK will improve the performance? Thank you.
Hi, Zhilong,
are you targeting whole exome or a gene panel? Are you trying to identify common SNPs or rare variants?
Info on genome vs exome bioinformatic analysis can be found here
I worked a bit on detection on rare variants and did some research that ended in a review, you can find it here
Hope this is a good starting point!
Hello,
I would be interested in the other way round, as I did only target resequencing until now. Why should there be any differences in the bioinformatics pipeline?
fin swimmer
You are right, I forgot to mention that when I did the targeted resequencing for rare variants I was using POOLED data. If you have individual sequence data, then the pipeline will not differ. Actually, if possible you should align to the whole genome anyway, so that you avoid spurious alignments to your regions of off-target reads that may have been generated. Also, several people filter polymorphic positions based on coverage (i.e. positions with very high or very low coverage are discarded). This is fine, but when you have targeted resequencing you have to keep into account the fact that you might have greater variability in coverage, and thus allow larger deviations (the numbers really depend on your data). I am sorry for the confusion I might have generated with my previous answer!!