I am currently working on the analysis of whole exome data. I need to align my reads with reference genome. I am wondering if there is any reference exome out there? As I am using a SureSelect Human All Exon 50Mb Kits, I know what I am currently capturing and sequencing. If there is a reference exome, I could save significant computing time on alignment and further downstream analysis. If no such reference exome available, why it is not a good idea? or What could be the pros and cons on generating one using GENCODE / CCDS / RefSeq ? Have you tried this before?
Of course a reference exome exists. The probes that you are using for exome capture were designed to be complementary to something. Generally, the target is some combination of RefSeq/CCDS/others.
However, as pointed out in other answers, you should map to the whole genome. In addition to the reasons given in other answers, consider that the capture process is not perfect, so you will effectively have a small sample of whole-genome sequencing data mixed in with your exome reads. You wouldn't want those "off-target" reads to be mapped on your exons, since they would likely have lots of mismatches and cause false-positive SNP calls and other havoc.
Anyway, the point is that you should really think about exome sequencing as whole-genome sequencing data that just happens to have deeper coverage near exons, because despite the capture process, there is a small but non-zero chance of getting a read from anywhere in the genome.
agreed. using an adapted reference with the results of a real world experiment will only help to build spurious alignments that may not be real ones. even if the probe set is as perfect as it could, you will always end sequencing parts of the genome that you may not be interested in, and for that reason the reads' origin should not be inferred. I hadn't seen that previous question before, but it reaffirmed my beliefs.
I think the idea is almost the same. I should have read through all exome questions here before posting this. Do you think we should close this question ?
Thanks Pierre and Jorge. I am still not convinced about the aspect of spurious alignments. In protein sequence analysis, I do targeted alignments for the analysis of proteins domains. As I am not interested in unassigned regions, such targeted alignments were more accurate and helpful for my analysis. I think the same could apply in the case of exomes as well.
For proteins you care about homology. For exome, you care about SNPs/INDELs. Sequences aligned together must be related in the past, but you would not want to call SNPs. The two tasks are different. For exome sequencing, there are lots of off-target reads. You can never trust captures. If you do analyses, you will know.
For proteins you care about homology. For exome, you care about SNPs/INDELs. Sequences aligned together must be related in the past, but you would not want to call SNPs. The two tasks are different. For exome sequencing, there are lots of off-target reads. You can never trust captures. I am sure you get worse alignment when not mapping to the whole genome. If you do analyses, you will know.
@Heng: I agreed, the protein alignments were driven by homology and exome analysis is focusing on SNP/INDELS. My point is, I have experienced similar scenarios in protein domain alignments. Protein domain alignments are often affected by insertions/deletions. From my experience protein domain sequence vary to great extent within the seed dataset.
@Heng: Thanks for sharing your thoughts. I am currently using the whole genome assembly for the alignment. But, I disagree with your point on extrapolating protein alignment experience to exome alignments. Indels in exome regions will have a direct impact on the underlying proteome, so I still believe my idea is valid.
For mapping, always try to map to the complete genome. Mapping to the whole genome is actually faster than image analyses and base calling. If you can afford base calling, you must have the capacity to do whole genome mapping.
For the downstream analyses, you may consider to extract alignments overlapping the target regions (plus short flanking regions) only. This may make things faster and the output more convenient. You can use bedtools or the latest samtools. If you are calling SNPs with samtools, you may provide a BED file to call variants in target regions only.
There may be the argument of losing the data in off-target regions, but I have never seen people analyze those regions.
As to the exact target regions, Agilent must have given you a file describing all the target regions, or at least the positions of the probes?
If you have bought the kit, Agilent should have told you the target regions in design. I know the regions for one of their kits, but I am not allowed to give it to others.
Well I think it is not necessarily a good idea because an exome is much less stable than a genome. It will be different between different cell types and at different moment.
That being said you could probably just take all exome data currently submitted to repositories and throw on an Amazon grid or so that is big enough and just calculate the average.
Ahh right.... I though you would not only want to have the exon sequence but also the average amount of expression as a reference. Must be my transcriptomics bias. But this is even nicer. Because you could do what I suggested and then for the sequence just couple it to the reference genome as well. Actually it would be interesting to know whether exome and genome yield the same average sequence.
Thanks Chris. Irrespective of expression dynamics in various tissues - the whole exome sequence will be a constant entity, so stability won't be an issue here.
Thanks Chris. Yes, your second suggestion is interesting. But I would rather get an reference exome using the data used to create the preps. But here the general consensus is that it generating an exome may not be a good idea because it may result in spurious alignments.
I am always wary of subsets of the data, and what they really represent. But I'd be particularly wary with what is or is not an exon. When we had a discussion recently about constitutive exons it made me question this again.
I know that not all exons are represented in the human gene collections yet. And the recent modENCODE paper on the fly transcriptome only made it more apparent to me that we don't know what an exon is yet. They found 100,000+ exons, half of them new or revised. 23,000 new splice junctions. "Of the new alternative exons, 8,226 were previously annotated as constitutive."
If I was going to make a call on really expressed exons, I'd go all the way to the EST data to do that. And I also think that's missing a lot of spatial and temporal exons.
Perhaps you are thinking of RNA-seq? For exome sequencing, we use lots of probes to grab genomic regions. The probes are provided by the company and we have little control on this part.
Thanks Ryan for your thoughtful comments !
+1 for the last paragraph. Well said.
+1 for the last two paragraphs. Btw, the link to 'neartarget.pdf' is not valid anymore!