How to handle off-target regions in amplicon panel analysis?
1
0
Entering edit mode
7.3 years ago
lamteva.vera ▴ 220

Dear biostars inhabitants!

I'm trying to figure out how to analyse data from Illumina's TrueSeq Custom Amplicon panel.

The manifest file for the panel provided by Illumina includes Probes and Targets sections, encompassing 788 and 822 entries, respectively. The Targets section includes "an expected off-target region" "in addition to the submitted genomic region", as documentation claims.

As far as I understand, these 34 expected off-target regions are regions, highly likely to bind primer pairs originally designed to target regions of interest. Thus, some of the targeted regions are not actually well-covered by the panel since it's hard to unambigously map the amplicons. Correct me if I'm wrong.

I'm looking for your expert advise: how can I use the information about predicted off-targets in sequencing data analysis? Should I exclude such regions from interval list used to restrict variant calling?

Thank you for your time. Have a nice day!

off-target targeted resequencing • 3.4k views
ADD COMMENT
2
Entering edit mode
7.3 years ago

Your assumptions are correct, and it's important to realise that a large chunk of the human genome exhibits some level of homology (generally, for a sequence of DNA to be regarded as homologous, it must exhibit 30% similarity to another or other regions). Don't quote me, but I read somewhere that >50% of all genes have a processed or unprocessed pseudogene elsewhere in the genome. Thus, the 'expected off-target' regions provided by Illumina for your panel are most likely these regions that exhibit high homology to your primary target regions of interest.

All of this poses great issues for alignment tools, which have to faithfully map each read to a position in the genome. If a read maps to >1 location, its mapping quality will suffer. However, if it maps to just a single region, then it will certainly have a high mapping quality. Base errors in each read neither help, in this regard, as they further reduce mapping quality and make the task of the aligner more difficult.

This issue is also in part to explain for the very uneven depth of coverage profile that you get with this type of sequencing, whereby one region may have >1000 reads mapped to it, whereas others may have just 20 (other reads that could have mapped to it were 'robbed' by homologous regions during PCR amplification and/or during in silico alignment).

From my experience of targeted sequencing using Illumina's kits, the amount of off-target reads is generally 30-40% of all reads (i.e. 30-40% of reads in each sample will map to regions outside of the primary regions of interest). There is not much that you can do about this other than work with Illumina to attempt to improve the problem.

Many regions of the genome are just not suited for massively parallel sequencing using short reads - the data from these regions just cannot be trusted due to the fact that such regions exhibit high homology to others in the genome. The way to tackle these is with long-range PCR or Sanger sequencing, where you can design primers far outside your region of interest in a region of unique sequence.

From an analysis perspective, the way that I manage this issue specifically is by:

  • Trim bases off the ends of reads that fall below Phred-scaled quality score of 30
  • Eliminate short reads (<50 or 70bp)
  • Only include uniquely-mapped reads (Bowtie allows this) or filter out reads with MAPQ<40 or 50 (BWA)
  • Use a BED file to filter out all reads or variants called in the off-target regions

Other people will of course have their own ideas, which are welcome.

I really appreciate your question as it touches on what is a major issue in next generation sequencing.

ADD COMMENT
1
Entering edit mode

Dear Kevin, thank you for your thoughtful answer and practical suggestions. I really appreciate your time.

Important note: I should point out that sequence similarity ≠ homology (I'm sure you know it as well as I do). Just to quickly remind it to everybody reading this post: "We infer homology when two sequences or structures share more similarity than would be expected by chance. Common ancestry explains excess similarity (other explanations require similar structures to arise independently); thus excess similarity implies common ancestry". (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/) "Phrases like “sequence (structural) homology”, “high homology”, “significant homology”, or even “35% homology” are as common, even in top scientific journals, as they are absurd, considering the above definition. In all of the above cases, the term “homology” is used basically as a glorified substitute for “sequence (or structural) similarity”" (see https://www.ncbi.nlm.nih.gov/books/NBK20255/ for more thoughts on this essential topic).

Returning to the question:

  1. I guess, I should modify my intervals list for variant calling by excluding poorly covered regions (those originally targeted regions that are prone to ambigous mapping) and just admit that these are not analyzable using the current approach. What do you, guys, think?

  2. Which factors one must consider while establishing read length threshold?

  3. Trimming bases is carried out primarily at the 3'-end, right? As far as I've heard, if you are in need of trimming the 5'-end, then your data is rather lousy and trimming is flogging a dead horse and is generally not recommended. I may be wrong.

Suggestions are welcome! Have a great day!

ADD REPLY
1
Entering edit mode

No problem :)

Yes, of course, I realised that 'homology' was a misused term many years ago. When the top scientists in the field even misuse it, what can one do? My statement regarding the high level of sequence similarity in the human genome through the presence of processed or unprocessed pseudogenes is still critical, though. These arose through DNA duplication events or RNA species becoming incorporated back into the genome after having been transcribed. There are undoubtedly many other mechanisms through which these sequences of similarity have arose. As pseudogenes are mostly non-functional and less conserved, they pick up more mutations over time, and thus lose similarity to their gene of origin.

Regarding point #1, our test data from the National Health Service in the UK shows that the minimum read-depth at which you should be calling a variant is 18 reads, with 30 being the optimal. This is for a clinical setting, though. I have seen true (Sanger-confirmed) variant calls at as low as read depth 2. If you are not interested in variants in the off-target regions, then I would just remove them (unless you would like to consider analysing them as a separate research project of some sort). If you use a high MAPQ for filtering and then see >18 reads over your regions of interest, then that is sufficient. I also knew a guy once who was happy with read-depth 10 as a cut-off (taking advice from Baylor College of Medicine in Texas).

You could still work with Illumina in order to improve the unique mapping of the primers for these regions. Otherwise you could design a new panel with Agilent, as they appear to take greater care when designing their primers for these specific problems.

Regarding point #2, the aligner is the critical choice. BWA mem works best with reads > 70bp in length, whereas Bowtie can work fine with reads as low as 30bp. If a read is any length of sequence and maps uniquely to the genome, then all great.

Regarding point #3, Yes, it trims from the 3' end because, in the Illumina sequencers quality suffers more at that end. I never specifically mentioned trimming at the 5' end. After all trimming has been performed, if the read then falls below the threshold for read length (70 bp for BWA mam), then it's eliminated.

I trust that this further information helps you to decide which is the best approach for your data.

Good luck! :)

ADD REPLY

Login before adding your answer.

Traffic: 1775 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6