FASTQC: per base n content fail
2
0
Entering edit mode
4 months ago
Eddie • 0

Hey, guys! I'm processing some whole-genome sequencing data, and some reported "per base n content" fail. I'm wondering that how the fail influencing SNV calling, SV calling and post analysis? And what should I do to confirm the exact reasons? Thank you very much!

per base n contentper base sequence quality

FASTQC • 752 views
ADD COMMENT
1
Entering edit mode

What's the nature of this data? It looks like it was processed already?

Your N content seems very specific around 120bp with a correspnding decrease in sequence quality at that position. If this isn't expected, you can take a look at the reads to see what is at that position.

What does the rest of the QC look like?

ADD REPLY
0
Entering edit mode

Thank you very much for your reply! You are right. The data was processed by Fastp. But after comparing FASTQC results before and after Fastp, I didn't find differences.

Here are all FASTQC results before Fastp. Following the reply below, I will show you graph with multiqc. enter image description hereThere are 15 samples from the same batch of sequencing data, and they all have the same issue.

ADD REPLY
0
Entering edit mode

Here are all FASTQC results after Fastp. enter image description here

ADD REPLY
0
Entering edit mode

Not all samples are affected. There are 2 clear groups of affected/unaffected in mean quality scores, and per sequence quality scores.

ADD REPLY
1
Entering edit mode
4 months ago
BioinfGuru ★ 2.1k

This sample looks great to me. I wish all my samples looked as good as that! There is nothing here to indicate a cause for concern. However this is just one sample. I'm sure you have many others in the dataset. I suggest running multiqc which gathers all fastqc output files and plots them collectively in one display, making it easier to see the overall picture of what is going across the whole dataset. i.e. instead of looking at 20 individual "per base sequence quality" graphs, see them all on one graph with multiqc.

ADD COMMENT
0
Entering edit mode

Thank you very much for your response. I've placed the MultiQC plots above~

ADD REPLY
1
Entering edit mode
4 months ago
GenoMax 147k

It appears that there was some kind of failure (hardware/software) around cycles 110-120 e.g. a bubble in the lane (happens at times). Only a part of the data (~20%) appears to be affected. You could proceed as is and see if the aligner handles those N's. If you have a large amount of data and can lose a fraction of the bases you could trim the reads containing N's, removing sequence starting at N's through the 3'-end of the read.

ADD COMMENT
0
Entering edit mode

Thank you very much for your suggestion! How can I ensure that the aligners have correctly handled these N bases?

ADD REPLY
0
Entering edit mode

Align and go through your analysis to see if this affects your SNP calls. As you posted above, fastp appears to have left these N calls in. If this is something critical (e.g. clinical samples) then remove the part of the reads containing N and everything to 3' of that.

ADD REPLY
0
Entering edit mode

@Genomax: Just trying to flesh out my understanding (in the context of genome/transcriptome analysis...not SNPs)

Considering the phred score is still in the green... I would have not thought twice about breezing past this, and would not want to trim the reads back to before the Ns-peak.

So how would we know if the aligner handles those N's. Just to keep the answer focused...what is the best case result from the aligner, and what is the worst case. Surely the aligner won't fail, it will just have fewer mapped reads (or mapped pairs) or more discordant reads in those samples right?

As regards trimming reads back to say 120bps in affected samples, surely that is more than enough length for mapping pairs. Although, I'm guessing transcriptome analysis would be affected if we trim 30bps off every read in a bunch of samples.

ADD REPLY
0
Entering edit mode

We don't exactly know how many cycles are affected since FastQC output is binned but the length appears to be significant enough (perhaps 10 cycles). I am going on the assumption that this was a hiccup with the sequencer and the remaining data after the N's will align well. There should be enough read left (after removing N's + 3'-part) to allow specific enough alignments so the end result should not change a lot. Something OP should be able to verify.

This also appears to be complete genomics data (if the names are an indication). So that would require a different type of trimming than illumina.

ADD REPLY

Login before adding your answer.

Traffic: 2401 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6