Question

Fastqc lsequence duplication and per base sequence content failed

0

Entering edit mode

2.1 years ago

waqaskhokhar999 ▴ 160

I have around 150bp, paired-end RNA-seq data of 321 samples with around 30 Million reads per sample and I am interested in quantifying the expression for the transcriptome-wide association studies (TWAS). I have performed QC using fastqc. The per base sequence content and duplication level is high in most of the samples,

MultiQC

For the per base sequence content, the multiqc reports shows that the base content is high in the start of the sequence, especially for the first 10 bases. The fastqc tutoria; says "Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream" so shall I leave it as it is if it will not affect the downstream analysis or if there is a way to cater it?

Per_base_seq_content

For the error in sequence duplication we have used the coverage of around 30M per sample to get the expression of low expressed transcripts so its possible to have high sequence duplication level, but do I need to do something to control/solve this error?

Am I good to proceed with the quantification using Salmon or do I need to perform some sort of action to improve the per-base sequence content and duplication level? I have checked

Suggestions will be highly appreciated

RNA-seq fastqc • 866 views

ADD COMMENT • link updated 2.1 years ago by GenoMax 151k • written 2.1 years ago by waqaskhokhar999 ▴ 160

score 3 · Answer 1 · 2023-03-21

Yes you are good to proceed. If you notice any issues later in the analysis you can backtrack to check on these.

Please see following informative blog posts by authors of FastQC that should address your concerns:

https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/
https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/