Bwa Soft-Clipping Pattern In Read Length Distribution Using "-Q 20" Parameter
1
1
Entering edit mode
12.1 years ago
toni ★ 2.2k

Hi all,

I just had to analyze a really bad lane. It is an old Illumina GAIIx lane with reads of 152 cycles. The last 30-40 cycles are of really really bad quality on average so I decided to use the -q 20 switch in bwa aln to trim reads 3'ends based on quality prior to mapping (something I usually would not do).

To have a look at what this trimming parameter left in the BAM file I drew the distribution of the length of soft-clipped part in the reads. To do so, I took the CIGAR string for 20 million alignments that were declared unique by BWA (XT:A:U tag). Here follows what I got :

Soft clipping pattern

So we can see that there is a periodic pattern after the main pic (representing no soft-clipped bases). We can also notice the slightly higher bar at position 117. 152-117=35, indeed by default BWA won't trim the reads to something less than 35bp.

Have you already noticed such a pattern using bwa aln -q and what in the algorithm produces this ?

Because looking at the -q definition in the doc, I can not see any reason why the trimming would have such a periodic pattern.

Thanks guys.

T.

bwa alignment quality • 5.1k views
ADD COMMENT
1
Entering edit mode

This I have noted before. It comes from the Illumina sequencing: The Meaning Of B In Illumina 1.5 Pipeline Data?

ADD REPLY
0
Entering edit mode

Thank you for this information.

ADD REPLY
0
Entering edit mode

awesome text plots! ;-) could you please add this as an answer as well?

ADD REPLY
0
Entering edit mode

Is it possible that the qualities in the file have some sort of periodicity?

ADD REPLY
0
Entering edit mode

I do not think so, but now that you raised the point, I will probably do a few checks on this :)

ADD REPLY
0
Entering edit mode

By looking at the link given in the comment below, it seems that you were right !

ADD REPLY
2
Entering edit mode
12.1 years ago

This has been noted before and Illumina tech support answered this:

The cyclic nature is a result of several factors:

The neighborhood analysis aspect of Illumina Q-scoring results in a 5-cycle cycle nature to the scores. This has been noted in the past. The other aspect is the quality of the sample. In certain cases this cyclic pattern may be more pronounced in some samples. The use of the Trim_Seq option may also result in a heightened presentation of the cycles, especially in cases of more aggressive trimming.

ADD COMMENT
0
Entering edit mode

I accept this as the right answer, because I don't think we will get a deeper explanation of this "problem". But that is enough to understand the putative origin of this pattern. Thanks again.

ADD REPLY

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6