First, you probably don't need to set the -T
flag, I believe that to be the alignment score (the sum of reward + penalties).
If the score were zero, the alignment did not really work. I don't think -T =0
does anything. I think -T
it is more of a way to select high scoring alignments rather than low-scoring ones. bwa
won't even look for low-scoring alignments to begin with.
In general, you should aim to produce more/all alignments, then filter your BAM file later, rather than forcing the aligner to make various decisions. There is little to be gained in most cases.
The mapping quality filter is -q
, (in samtools
) would be a way to filter out multi mapped reads as bwa assigns a mapQ=0
for multimapped reads.
Now as for the gaps, it is a bit curious, but what region is this, show more information, what regions is this, also download some of the SNP/annotation data they have on the website, see if anything overlaps here
https://www.biologiaevolutiva.org/greatape/data.html
the human genome took many tries to complete, the chimp has nowhere that much effort put into it.
Hi, on what reference genome did you align the reads ? What kind of sequencing data is it ? How were the sequencing libraries prepared ?
Hi Carlo,
Reference genome: Clint_PTRv2/panTro6 (Chimpanzee) https://hgdownload.soe.ucsc.edu/goldenPath/panTro6/bigZips/panTro6.fa.gz)
Sequencing: Illumina Hiseq 2000 instrument, PCR-based library https://www.nature.com/articles/nature12228#Sec2
Sample: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=1920536
Alignment: bwa mem -T 0 (output no aligments score < 0)
One possibility is that the gaps correspond to DNA sequences of repetitive nature. Since you have excluded multi-mapping reads (whose alignment score = 0), repeated regions are expected to have 0 coverage. To validate this, you could either rerun bwa without the option
-T
, or investigate if the gaps overlap annotated repeated regions.Hi Carlo,
Thanks a lot for your suggestion!
From the bwa manual:
-T INT Don’t output alignment with score lower than INT. This option only affects output. [30]
With -T set to 0: If reads with score lower than 0 are omitted does't this mean reads with score exactly 0 are still in the output?
Also if I leave -T altogether won't bwa mem resort to default value [30] which would output no reads with a score < 30, leaving the same problem?
How do I allow multi-mapped reads in bwa mem?
It is very important for my downstream analysis to keep reads aligning to repetitive regions as I am doing variant calling on repeats.
You have no choice but allow multi-mapped reads if you are interested in repeated regions. Also make sure that you are using an unmasked reference genome. I can not tell you much more than that because I'm not an expert on analysis on CGG short tandem repeats. From what I read, the number of repeats vary from individual to individual, so it might requires a more specific mapping strategy to capture the reads supporting CGG repeats. See this paper for instance.
Im still unsure how to allow multi-mapped reads with bwa mem, see my previous comment. (Sorry I changed it, still getting used to biostars site).
Another option is -a:
-a Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.
Isn't that the way to do it?