Hi all,
I have carried out transcriptome sequencing on a plant species and then, from the resulting assembled contigs, have designed capture probes for 970 gene regions. I then carried out hybrid-capture target-enrichment on whole genomic DNA followed by Illumina 100bp paired-end sequencing for 95 samples. My aim is to call SNP variants within these 970 gene regions among all my samples in order to look at neutral as well as adaptive processes. Before I get to this stage though there are a few things I am unsure about:
- When I map the reads for an individual back to the 970 contig sequences the probes were designed on I only get 15-30% of the reads mapping back, even with low mapping stringencies such as 50% overlap and 80% similarity. Could this be a result of the probes pulling out a lot of stuff outside of what I was targeting, such as introns, promoter regions, etc.?
- I do not have a reference genome for this species so my plan was to map the reads back to the transcriptome I assembled and call variants based on that. However, as the transcriptome sequences don't contain any introns am I going to have issues with reliably mapping the captured sequences (which may contain parts of introns, promoter regions, etc.) to this transcriptome reference? And could this also be why I seem to be getting a large number of broken pairs in the mappings?
Any help/advice with this would be greatly appreciated!
Thanks,
Matt
Thanks Sean, as I suspected. When you've got ~4 million reads per individual, 20% mapping is still a lot of data so, as you say, will still be reasonably usable.