Question

wrong quality plots in fastqc output

0

Entering edit mode

2.1 years ago

poecile.pal ▴ 50

Good morning,

I simulated reads based on the reference genome using samtools wgsim

wgsim -N 30000000 -1 151 -2 151 -r 0 -R 0 -X 0 -e 0 genome.fasta Sample_R1.fastq Sample_R2.fastq

and obtained fastq files with such content:

@DQ898156.1_36602_37076_0:0:0_0:0:0_0/1
CTGTAGTCTGGCACTGCAAAAACAGGATACAGGTGTATATATGATATATATATATGTGTGGACATGTTGTGTATAAAGAACGAAAAAATGCGGATATGGTCGAATGGTAAAATTTCTCTTTGCCAAGGAGAAGATGCGGGTTCGATTCCCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@DQ898156.1_147753_148277_0:0:0_0:0:0_1/1
GGGATCCTCGCGGACAGAAAAAGATTGCAGTCAGTTTGATAATGATCGAGTGACATTGCTTCTTCGGCCCGAACCAAGGAATCCCTTAGATATGATGCAAAACGGATCTTGTTCTATCCTTGATCAGAGATTTCTCTATGAAAAAAACGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Then I launched fastqc. Surprisingly, Per base sequence quality plots are bad:

fastqc per base qulity plot

At the same time, I corresponds to a high level of phred quality!

Also I see 99.1% Dups in the report. But simply scrolling through the fastq file shows me that this is not the case. enter image description here

Could you please explain me, what is the reason for such an unexpected fastqc result? (Maybe the fastq encoding was incorrectly recognized) Will other programs work correctly with my simulated data (like bwa-mem2)?

Best regards, Poecile

samtools fastqc fastq wgsim • 1.0k views

ADD COMMENT • link updated 22 months ago by Ram 45k • written 2.1 years ago by poecile.pal ▴ 50

score 2 · Accepted Answer · 2023-03-21

2

Entering edit mode

2.1 years ago

GenoMax 151k

Per base sequence quality plots are bad

How so? Because your phred scores are so high they are not even showing up on your fastqc plot since Y-axis only goes up to Q34.

But simply scrolling through the fastq file shows me that this is not the case.

FastQC only looks at the first 100K reads when it is working on deduplication. It also trims reads over 75 bp down to 50 bp to keep memory requirement under control. I don't know how wgsim simulates the reads but if those reads happen to be represented multiple times later in the file then you will see the result you have.

ADD COMMENT • link 2.1 years ago by GenoMax 151k

0

Entering edit mode

Thank you so much for such a quick response!

Everything is clear with plots now.

As for duplication, it's a pity, I was hoping that it was a fastqс error :) Perhaps this is due to the fact that I used platome as a reference, with IRA and IRB... But they do not occupy such a large % of the sequence.

ADD REPLY • link 2.1 years ago by poecile.pal ▴ 50

0

Entering edit mode

I understood why I got high percentage of duplication. The size of reference was about 150K bp, while I asked for 30M reads with length 151 bp. Of course, they strongly overlap.

ADD REPLY • link 2.1 years ago by poecile.pal ▴ 50