Question

Haloplex & Allele Calling

4

Entering edit mode

12.3 years ago

Pierre Lindenbaum 166k

My lab has started using the Haloplex technology to capture the regions of interest.

With this technology most of the reads have the same start/end (I've found that removing the duplicates would reduce the coverage to ~1 base). The reads are grouped in a vertical 'cluster' that can contains more than 1000 reads.

enter image description here

Have you ever used this technology ? I'm currently using samtools mpileup with the option "-A" ( count anomalous read pairs) and "-d 8000" ( max per-BAM depth).

I'm afraid there is a kind of bias with haloplex: is it safe to use samtools or Gatk to call the variants ?

Pierre

allele calling next-gen samtools gatk duplicates • 5.8k views

ADD COMMENT • link updated 12.3 years ago by User 59 13k • written 12.3 years ago by Pierre Lindenbaum 166k

score 4 · Accepted Answer · 2013-01-08

In general, any capture-based technology will have bias. In your case, you're worried that the the bias will be amplified in the detection step.

I would use as many methods as possible to call these alleles, and see what the differences are. With coverage this high, there should be little variation in results.

Furthermore, simulation could provide an avenue to assessing potential bias in the case of haploplex resequencing. In simulated tests, I have observed that samtools and GATK have lower sensitivity than our caller (freebayes) at high depths, but all perform roughly the same at lower coverage. It's not entirely clear to me why this is, but I suspect default parameter selection may play a role. I haven't evaluated other callers, and it's possible that others may perform even better in high-depth contexts.

score 3 · Accepted Answer · 2013-01-08

Agilent will say that you should not de-duplicate Halo data. The vertical clusters are due to the placement of the restriction enzyme sites used to create the capture kits. There's very little off-target capture with Halo and the coverage can vary widely across an capture region ('Manhattan skyline' I think of it as).

You can use, and I have used samtools, GATK or VarScan to call variants from Halo data. However if you do use GATK then if you're post-filtering UnifiedGenotyper calls based on assumptions for exome or WGS data, you will probably find variants which do not pass your standard filtering test that are almost certainly genuine. Strand bias filters etc. seem to be the most prone to being tripped.

I've spoke to Agilent technical support before Christmas about what they propose for best practice genotyping, and the impression I got is that it's something they're working on but don't have a definitive answer for yet. I hope you're trimming the reads in conjunction with their guidelines as well.

EDIT: Since I wrote this post SureCall was released:

http://www.genomics.agilent.com/en/NGS-Data-Analysis-Software/SureCall/?cid=AG-PT-154

This is capable of analysing HaloPlex data and comes directly from Agilent to support the analysis of Haloplex designs from the SureDesign wizard.