Entering edit mode
8.9 years ago
Sam
★
4.8k
Recently we have got a re-sequencing data from our collaborator sequenced by a Chinese company where they performed ligation mediated PCR targeted re-sequencing using the illumina HiSeq machine.
Currently we are having difficulties in analyzing this set of data.
TL;DR How would you recommend us to analyzing these data?
Some property of the data
- Using ligation mediated PCR for the capture, therefore most reads have the same starts and ends (look something like this)
- Average depth of coverage is 2500x with maximum can be as much as 8000x
- We have around 2000+ samples
Problem encountered
- Should we perform mark duplication? Without mark duplication, the depth will be extremely high, making the downstream analysis slow. However, as the company use ligation mediated PCR, most reads have the same start and end, therefore mark duplication (picard tools) can removed 90% of all the reads. We are uncertain whether if this is too drastic...
- The
-dcov
parameter of GATK, ideally, we should set it to something like mean coverage x 10 (based on here), e.g. 25000 in our case, which makes the whole process very slow. - Can UnifiedGenotyper or HaplotypeCaller handle such data? From our understanding, GATK employed a Bayesian model (here) for the calling of genotype. Although they used a flat prior, will the performance of GATK be negatively impacted in our case? e.g. because of the strange coverage distribution of the data?
Current Ideas
- Perhaps the easiest way to analyse the data will be straight out follow the normal GATK pipeline (e.g. MarkDuplication -> Realignment -> BQSR-> HC / UG). Is this method appropriate? Or should we not perform the MarkDuplication?
- Considering the extreme high coverage and the redundancy of the reads, an other idea I can think up of is to perform a denovo assemble of these reads and then align the resulting contigs to the reference genome for calling. However, this seems very unconventional and we are uncertain whether if this will work.
We are rather stuck here as we have never dealt with such data before. Does anyone have any experience on these kind of data? Thank you for your inputs!
If they did a capture you should mark duplicates.
If it is human data, the gatk pipeline is fine. Use the gVCF route for that many samples.
Is it paired end data?