Question

Mapping target-enriched reads to a transcriptome reference

0

Entering edit mode

10.2 years ago

matt.christmas85 ▴ 10

Hi all,

I have carried out transcriptome sequencing on a plant species and then, from the resulting assembled contigs, have designed capture probes for 970 gene regions. I then carried out hybrid-capture target-enrichment on whole genomic DNA followed by Illumina 100bp paired-end sequencing for 95 samples. My aim is to call SNP variants within these 970 gene regions among all my samples in order to look at neutral as well as adaptive processes. Before I get to this stage though there are a few things I am unsure about:

When I map the reads for an individual back to the 970 contig sequences the probes were designed on I only get 15-30% of the reads mapping back, even with low mapping stringencies such as 50% overlap and 80% similarity. Could this be a result of the probes pulling out a lot of stuff outside of what I was targeting, such as introns, promoter regions, etc.?
I do not have a reference genome for this species so my plan was to map the reads back to the transcriptome I assembled and call variants based on that. However, as the transcriptome sequences don't contain any introns am I going to have issues with reliably mapping the captured sequences (which may contain parts of introns, promoter regions, etc.) to this transcriptome reference? And could this also be why I seem to be getting a large number of broken pairs in the mappings?

Any help/advice with this would be greatly appreciated!

Thanks,
Matt

paired-end target-enrichment mapping transcriptome • 2.2k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by matt.christmas85 ▴ 10

Ram · Answer 1 · 2014-10-01

1

Entering edit mode

10.2 years ago

Sean Davis 27k

Capture efficiencies of hybrid capture vary, but having a capture efficiency of 50% or so would not be unusual. That, combined with the fact that you are mapping DNA back to RNA could very easily lead to the mapping issues you are seeing. While it is disheartening to see so much of your data falling through the cracks, I suspect that the data that is mapping is reasonably usable (with the caveat that there may be a significant false positive rate for SNPs at exon boundaries).

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Sean Davis 27k

0

Entering edit mode

Thanks Sean, as I suspected. When you've got ~4 million reads per individual, 20% mapping is still a lot of data so, as you say, will still be reasonably usable.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by matt.christmas85 ▴ 10