Question

Why does ExpansionHunter ignore unmapped reads?

1

Entering edit mode

5.5 years ago

Duarte Molha ▴ 240

Hi Guys

I have decided to test ExpansionHunter on a well characterised FragileX sample... but one paragraph in their help page got me confused

https://github.com/Illumina/ExpansionHunter/blob/master/docs/01_Introduction.md

The current release (v3.x.x) forgoes analysis of unaligned reads. Although some informative read pairs are occasionally unaligned, according to our benchmarks these reads are very unlikely to change pathogenic/normal classification of the currently-covered repeats. We encourage you to contact the authors if you are working with the regions for which this is a concern.

This makes very little sense to me.

For example, a fragileX sample will have a triplet expansion of CGG in the hundreds of copies. The current HG38 assembly for this region contains only around 30 of these triplets, so the many hundreds, potentialy thousands of reads that consist of only this triplet expansion will not map againts the reference genome.

So how can you detect the expansion if you are deliberatly ignoring the unmapped reads that correspond to this expansion?

Can anyone clarify this for me?

STR • 1.9k views

ADD COMMENT • link 5.5 years ago by Duarte Molha ▴ 240

0

Entering edit mode

If the reference has a region of sufficient motif, then your reads will align to it. Maybe you end up with very high coverage of the tandem repeat segment. In my experience the unmapped reads tend to be technical artifacts or low quality data.

ADD REPLY • link 5.5 years ago by karl.stamm 4.1k

0

Entering edit mode

If the region on the reference is a short version of the set of motifs and the biological sample has that motif deleted hundreds of times you will have reads that consist of only the motif ... Would a aligner not find it hard to align those reads to the genome has their would not be a good seed for it ?

I ask this because i have aligned a fragile X sample and I'm the region of where I know there is a over 500x triplet expansion BWA actually is unable to map almost any reads and I have 5 mapped reads on the area of the motif.

ADD REPLY • link 5.5 years ago by Duarte Molha ▴ 240

0

Entering edit mode

Take a look and the coverage profiles I am getting

In the region in the centre (the hotspot for fragile X STRs) the coverage drops. I thought originally that this made sense since bwa mem simply could not find a seed point at which to anchor the many reads that consist simply of the STR sequence.

This is a coverage profile comparison of a sample with fragileX (over 500 copies of the trimer) and a normal sample (nor characterised ... but somewhere between 5 and 45 copies is considered normal. The reference at that position contains 20 copies of the expected STR "GCC"

FMR1 coverage profile comparison of a fragileX sample with a control sample

ADD REPLY • link 5.5 years ago by Duarte Molha ▴ 240

0

Entering edit mode

I have now looked into the non-mapped reads and you are probably correct... I see no evidence of STR reads unmapped... I am guessing therefore that this is the result of my capture baits unable to capture DNA fragments that are very very GC rich (100% in fact) or realy the bias of the PCR unable to amplify these fragments

ADD REPLY • link 5.5 years ago by Duarte Molha ▴ 240

0

Entering edit mode

What was the sequencing technique? I've seen coverage humps like that when using Whole Exome kits. It's not intended to capture broad sections of genome. There may be other areas with similar sequence motif that are 'stealing' your reads, since this one only has 60bp (20x3) and you may have 100bp reads of 'GCC' gone elsewhere. I don't know that GC bias would cause a complete loss of those reads, just a reduction (hopefully simlarly reduced in both samples, allowing for normal/control comparisons). You may need long-read sequencing. When I need to see a particular sequence, I like to add it to the genome reference explicitly, or use a specialized aligner, like STAR which is designed for a transcript database (what BWA would consider tens of thousands of miniature chromosomes).

ADD REPLY • link 5.5 years ago by karl.stamm 4.1k

0

Entering edit mode

It is a panel capture... but a large panel. I thought of that as well and I have looked at regions where fragileX has some homology... you can get these on the expansion hunter documentation

https://github.com/Illumina/ExpansionHunter/blob/master/docs/04_VariantCatalogFiles.md

and the coverage there is also non-existent.. if what you are describing was happening some of these regions would be getting some reads.

I think this points to a failure of PCR amplification of these fragments for the library prep... but that is only my guess

Thank you for your help

ADD REPLY • link 5.5 years ago by Duarte Molha ▴ 240