Is There Any Reference Exome ?
5
6
Entering edit mode
13.6 years ago

I am currently working on the analysis of whole exome data. I need to align my reads with reference genome. I am wondering if there is any reference exome out there? As I am using a SureSelect Human All Exon 50Mb Kits, I know what I am currently capturing and sequencing. If there is a reference exome, I could save significant computing time on alignment and further downstream analysis. If no such reference exome available, why it is not a good idea? or What could be the pros and cons on generating one using GENCODE / CCDS / RefSeq ? Have you tried this before?

Looking forward for your thoughts.

exome next-gen-sequencing • 12k views
ADD COMMENT
9
Entering edit mode
13.6 years ago
Ryan Thompson ★ 3.6k

Of course a reference exome exists. The probes that you are using for exome capture were designed to be complementary to something. Generally, the target is some combination of RefSeq/CCDS/others.

However, as pointed out in other answers, you should map to the whole genome. In addition to the reasons given in other answers, consider that the capture process is not perfect, so you will effectively have a small sample of whole-genome sequencing data mixed in with your exome reads. You wouldn't want those "off-target" reads to be mapped on your exons, since they would likely have lots of mismatches and cause false-positive SNP calls and other havoc.

Also, there is not a well-defined boundary between "on-target" and "off-target" -- coverage tails off gradually as you get away from the probed regions: http://mneme.dyndns.org/~ryan/examples/neartarget.pdf

Anyway, the point is that you should really think about exome sequencing as whole-genome sequencing data that just happens to have deeper coverage near exons, because despite the capture process, there is a small but non-zero chance of getting a read from anywhere in the genome.

ADD COMMENT
0
Entering edit mode

Thanks Ryan for your thoughtful comments !

ADD REPLY
0
Entering edit mode

+1 for the last paragraph. Well said.

ADD REPLY
0
Entering edit mode

+1 for the last two paragraphs. Btw, the link to 'neartarget.pdf' is not valid anymore!

ADD REPLY
5
Entering edit mode
13.6 years ago

Khader, I wonder if your idea is not the same (bad) idea that I had a few weeks ago: Exome Sequencing: Masking The Non-Genic Sequences ? "Exome sequencing: masking the non-genic sequences ?"

ADD COMMENT
2
Entering edit mode

agreed. using an adapted reference with the results of a real world experiment will only help to build spurious alignments that may not be real ones. even if the probe set is as perfect as it could, you will always end sequencing parts of the genome that you may not be interested in, and for that reason the reads' origin should not be inferred. I hadn't seen that previous question before, but it reaffirmed my beliefs.

ADD REPLY
1
Entering edit mode

no, it's not really the same question but it's the same conclusion.

ADD REPLY
0
Entering edit mode

I think the idea is almost the same. I should have read through all exome questions here before posting this. Do you think we should close this question ?

ADD REPLY
0
Entering edit mode

Thanks Pierre and Jorge. I am still not convinced about the aspect of spurious alignments. In protein sequence analysis, I do targeted alignments for the analysis of proteins domains. As I am not interested in unassigned regions, such targeted alignments were more accurate and helpful for my analysis. I think the same could apply in the case of exomes as well.

ADD REPLY
0
Entering edit mode

Glad you didn't close this; this is generating a lot of interesting discussion even if the idea is ultimately considered not a good one.

ADD REPLY
0
Entering edit mode

For proteins you care about homology. For exome, you care about SNPs/INDELs. Sequences aligned together must be related in the past, but you would not want to call SNPs. The two tasks are different. For exome sequencing, there are lots of off-target reads. You can never trust captures. If you do analyses, you will know.

ADD REPLY
0
Entering edit mode

For proteins you care about homology. For exome, you care about SNPs/INDELs. Sequences aligned together must be related in the past, but you would not want to call SNPs. The two tasks are different. For exome sequencing, there are lots of off-target reads. You can never trust captures. I am sure you get worse alignment when not mapping to the whole genome. If you do analyses, you will know.

ADD REPLY
0
Entering edit mode

@Heng: I agreed, the protein alignments were driven by homology and exome analysis is focusing on SNP/INDELS. My point is, I have experienced similar scenarios in protein domain alignments. Protein domain alignments are often affected by insertions/deletions. From my experience protein domain sequence vary to great extent within the seed dataset.

ADD REPLY
0
Entering edit mode

My point is, they are very different scenarios. Experiences can hardly be transferred.

ADD REPLY
0
Entering edit mode

@Heng: Thanks for sharing your thoughts. I am currently using the whole genome assembly for the alignment.

ADD REPLY
0
Entering edit mode

@Heng: Thanks for sharing your thoughts. I am currently using the whole genome assembly for the alignment. But, I disagree with your point on extrapolating protein alignment experience to exome alignments. Indels in exome regions will have a direct impact on the underlying proteome, so I still believe my idea is valid.

ADD REPLY
3
Entering edit mode
13.6 years ago
lh3 33k

For mapping, always try to map to the complete genome. Mapping to the whole genome is actually faster than image analyses and base calling. If you can afford base calling, you must have the capacity to do whole genome mapping.

For the downstream analyses, you may consider to extract alignments overlapping the target regions (plus short flanking regions) only. This may make things faster and the output more convenient. You can use bedtools or the latest samtools. If you are calling SNPs with samtools, you may provide a BED file to call variants in target regions only.

There may be the argument of losing the data in off-target regions, but I have never seen people analyze those regions.

As to the exact target regions, Agilent must have given you a file describing all the target regions, or at least the positions of the probes?

ADD COMMENT
1
Entering edit mode

If you have bought the kit, Agilent should have told you the target regions in design. I know the regions for one of their kits, but I am not allowed to give it to others.

ADD REPLY
0
Entering edit mode

Thanks for your detailed reply. I heard from a collegue that probe/target data is available from Agilent - I am searching for a direct link. See: http://seqanswers.com/forums/showthread.php?t=5803 and Human Exome Capture Library Coordinates Download

ADD REPLY
0
Entering edit mode

Yeah, I just got the target region from Agilent.

ADD REPLY
2
Entering edit mode
13.6 years ago

Well I think it is not necessarily a good idea because an exome is much less stable than a genome. It will be different between different cell types and at different moment.

That being said you could probably just take all exome data currently submitted to repositories and throw on an Amazon grid or so that is big enough and just calculate the average.

ADD COMMENT
2
Entering edit mode

Ahh right.... I though you would not only want to have the exon sequence but also the average amount of expression as a reference. Must be my transcriptomics bias. But this is even nicer. Because you could do what I suggested and then for the sequence just couple it to the reference genome as well. Actually it would be interesting to know whether exome and genome yield the same average sequence.

ADD REPLY
1
Entering edit mode

+1. The idea itself is interesting. But it is capture determined. Also, you will end up with >100Mb (if not much longer) target sequences this way.

ADD REPLY
0
Entering edit mode

Thanks Chris. Irrespective of expression dynamics in various tissues - the whole exome sequence will be a constant entity, so stability won't be an issue here.

ADD REPLY
0
Entering edit mode

Thanks Chris. Yes, your second suggestion is interesting. But I would rather get an reference exome using the data used to create the preps. But here the general consensus is that it generating an exome may not be a good idea because it may result in spurious alignments.

ADD REPLY
0
Entering edit mode
13.6 years ago
Mary 11k

I am always wary of subsets of the data, and what they really represent. But I'd be particularly wary with what is or is not an exon. When we had a discussion recently about constitutive exons it made me question this again.

I know that not all exons are represented in the human gene collections yet. And the recent modENCODE paper on the fly transcriptome only made it more apparent to me that we don't know what an exon is yet. They found 100,000+ exons, half of them new or revised. 23,000 new splice junctions. "Of the new alternative exons, 8,226 were previously annotated as constitutive."

If I was going to make a call on really expressed exons, I'd go all the way to the EST data to do that. And I also think that's missing a lot of spatial and temporal exons.

IMHO.

Fly paper: http://www.ncbi.nlm.nih.gov/pubmed/21179090

ADD COMMENT
1
Entering edit mode

Perhaps you are thinking of RNA-seq? For exome sequencing, we use lots of probes to grab genomic regions. The probes are provided by the company and we have little control on this part.

ADD REPLY
0
Entering edit mode

Perhaps you are thinking of RNA-seq?

ADD REPLY

Login before adding your answer.

Traffic: 2623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6