Question

Gaps in alignment

2

Entering edit mode

4.1 years ago

mikegrootde ▴ 20

I have aligned 100bp paired-end reads from 25x coverage sequencing experiment to a reference genome. When I view the alignment I sometimes see these large gaps where almost no reads align whatsoever:

enter image description here

Why does this happen and can I do something about it?

NGS mem bwa alignment • 3.4k views

ADD COMMENT • link 4.1 years ago by mikegrootde ▴ 20

0

Entering edit mode

Hi, on what reference genome did you align the reads ? What kind of sequencing data is it ? How were the sequencing libraries prepared ?

ADD REPLY • link 4.1 years ago by Carlo Yague 9.0k

0

Entering edit mode

Hi Carlo,

Reference genome: Clint_PTRv2/panTro6 (Chimpanzee) https://hgdownload.soe.ucsc.edu/goldenPath/panTro6/bigZips/panTro6.fa.gz)

Sequencing: Illumina Hiseq 2000 instrument, PCR-based library https://www.nature.com/articles/nature12228#Sec2

Sample: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=1920536

Alignment: bwa mem -T 0 (output no aligments score < 0)

ADD REPLY • link 4.1 years ago by mikegrootde ▴ 20

2

Entering edit mode

One possibility is that the gaps correspond to DNA sequences of repetitive nature. Since you have excluded multi-mapping reads (whose alignment score = 0), repeated regions are expected to have 0 coverage. To validate this, you could either rerun bwa without the option -T, or investigate if the gaps overlap annotated repeated regions.

ADD REPLY • link 4.1 years ago by Carlo Yague 9.0k

0

Entering edit mode

Hi Carlo,

Thanks a lot for your suggestion!

From the bwa manual:

-T INT Don’t output alignment with score lower than INT. This option only affects output. [30]

With -T set to 0: If reads with score lower than 0 are omitted does't this mean reads with score exactly 0 are still in the output?

Also if I leave -T altogether won't bwa mem resort to default value [30] which would output no reads with a score < 30, leaving the same problem?

How do I allow multi-mapped reads in bwa mem?

It is very important for my downstream analysis to keep reads aligning to repetitive regions as I am doing variant calling on repeats.

ADD REPLY • link 4.1 years ago by mikegrootde ▴ 20

0

Entering edit mode

You have no choice but allow multi-mapped reads if you are interested in repeated regions. Also make sure that you are using an unmasked reference genome. I can not tell you much more than that because I'm not an expert on analysis on CGG short tandem repeats. From what I read, the number of repeats vary from individual to individual, so it might requires a more specific mapping strategy to capture the reads supporting CGG repeats. See this paper for instance.

ADD REPLY • link 4.1 years ago by Carlo Yague 9.0k

0

Entering edit mode

Im still unsure how to allow multi-mapped reads with bwa mem, see my previous comment. (Sorry I changed it, still getting used to biostars site).

Another option is -a:

-a Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.

Isn't that the way to do it?

ADD REPLY • link 4.1 years ago by mikegrootde ▴ 20

score 2 · Answer 1 · 2021-06-02

2

Entering edit mode

4.1 years ago

Istvan Albert 102k

First, you probably don't need to set the -T flag, I believe that to be the alignment score (the sum of reward + penalties).

If the score were zero, the alignment did not really work. I don't think -T =0 does anything. I think -T it is more of a way to select high scoring alignments rather than low-scoring ones. bwa won't even look for low-scoring alignments to begin with.

In general, you should aim to produce more/all alignments, then filter your BAM file later, rather than forcing the aligner to make various decisions. There is little to be gained in most cases.

The mapping quality filter is -q, (in samtools) would be a way to filter out multi mapped reads as bwa assigns a mapQ=0 for multimapped reads.

Now as for the gaps, it is a bit curious, but what region is this, show more information, what regions is this, also download some of the SNP/annotation data they have on the website, see if anything overlaps here

https://www.biologiaevolutiva.org/greatape/data.html

the human genome took many tries to complete, the chimp has nowhere that much effort put into it.

ADD COMMENT • link 4.1 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you for your reply. I believe the gaps are regions with CGG repeats and overall high GC content. I pasted the sequences corresponding to some gaps in a GC content calculator and get GC% of around 70.

Could it be possible that there are just no high GC% reads in this chimp data because of the bias in the PCR library amplification step used in sequencing?

ADD REPLY • link 4.1 years ago by mikegrootde ▴ 20

0

Entering edit mode

Yes that is possible,

In addition assembling these regions is also much more difficult and error-prone.

Pick an additional newer data, or perhaps one with long-read technologies, and see what happens.

ADD REPLY • link 4.1 years ago by Istvan Albert 102k

score 0 · Answer 2 · 2021-06-09

0

Entering edit mode

4.1 years ago

mikegrootde ▴ 20

I tried a PCR-free dataset and it works like a charm. We need more PCR-free WGS!! Thank you for your comments Istvan and Carlo

ADD COMMENT • link 4.1 years ago by mikegrootde ▴ 20