How to explain fastq insert size peak
0
0
Entering edit mode
3.7 years ago
Michael ▴ 270

How do you explain the following insert-size peak? Novaseq PE 150bp. Insert size estimation by fastp. I assume it is an artifact from fastp's insert size estimation, not sure how this happens though.

It is exactly at the read length of 151 bp (fastp runs combined with MultiQC):

enter image description here

RNAseq trimming • 5.1k views
ADD COMMENT
0
Entering edit mode

I am not a fastp user so don't know how it is calculating insert size. My assumption would be by overlapping R1/R2 reads. So that peak may represent reads that overlap by just one bp.

If you were interested in calculating the insert sizes then you could also try BBMap suite: C: Target fragment size versus final insert size

ADD REPLY
1
Entering edit mode

I think it is not as you described. fastp will calculated overlaps form R1/R2 pairs. But I think it will just go down to about 30bp overlap. For fewer bases the risk of overlap by chance occurs in repetitive regions.

So I think the peak is where R1 and R2 are pretty much exactly aligning start to end. But still I would not see where the peak comes from.

EDIT: given that we have low percentages on the Y-Axis the peak it not that extreme. I still want to understand how this happens...

ADD REPLY
1
Entering edit mode

R1 and R2 are pretty much exactly aligning start to end

Thinking about this again that makes sense.

Fastq example report page says this

This estimation is based on paired-end overlap analysis, and there are 3.771313% reads found not overlapped. The nonoverlapped read pairs may have insert size <30 or >272, or contain too much sequencing errors to be detected as overlapped.

Even in the fastp example report there appears to be a peak at the same location as yours. MultiQC seems to be exaggerating the Y-scale a bit.

ADD REPLY
0
Entering edit mode

Thanks! I saw that on the fastp example page. But I still do not get how this is happening... :(

ADD REPLY
1
Entering edit mode

Likely an artifact as you originally said. If you have a dataset of different sequencing length see if the peak shifts accordingly.

ADD REPLY

Login before adding your answer.

Traffic: 2027 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6