I have human bulk RNA-seq paired-end reads (R1, R2) and the fastqc shows that there are multiple overrepresented sequences (that are not adaptors). Also the per base sequence content shows a warning. I used BLAT to check the overrepresented sequences and they all are from either chrUn_GL000220v1 or chr14 except the sequence GGGGGG... from R2.
a) I need to trim the last 5 bases from both R1 and R2. I have read that the first 12 bases are fine and do not need to be trimmed for RNA-seq analysis ( correct me if I am wrong). b) I also need to trim the overrepresented sequences since they are contamination except the GGGG.. that did not align to a sequence from human genome.
Below is the link to the reports: https://hmaryam0.wixsite.com/fastqc-reps
What will be order for trimming? should I trim them A) all in one run? or B) 1. ends 2. overrep seqs or C) 1. overrep seqs 2. ends I have tried them all and they all end up with different results.
A) cutadapt -u -5 -U -5 --pair-filter any --minimum-length 10 -a (overreps) A- (overreps) 10 -o tr_R1.fastq -p tr_R2.fastq R1.fastq R2.fastq
B) 1. cutadapt -u -5 -U -5 --pair-filter any --minimum-length -a (overreps) A- (overreps) 10 -o tr_ends_R1.fastq -p tr_ends_R2.fastq R1_.fastq R2.fastq 2. cutadapt -a (overreps) A- (overreps) -o tr_R1.fastq -p tr_R2.fastq tr_ends_R1.fastq tr_ends_R2.fastq
C) 1. cutadapt -a (different overreps) A- (different overreps) -o tr_overreps_R1.fastq -p tr_overreps_R2.fastq R1.fastq R2.fastq 2. cutadapt -u -5 -U -5 --pair-filter any --minimum-length 10 -o tr_R1.fastq -p tr_R2.fastq tr_overreps_R1.fastq tr_overreps_R2.fastq
a) Correct do not trim initial 10-15 bases.
b) Do not do anything to over-represented sequences if they are not adapters. Check to see if they are rRNA bases otherwise you may end up throwing away good data.
c) Poly-G's are likely
no signal = G
issue from 2-color chemistry. You can remove those stretches.See these informative blog posts:
https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/
https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/
https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/
Thank you very much for your response. The reads are not rRNA but they are from human Chr. Are they not considered as contamination then?
If they are aligning to the correct genome then they are not contamination. It is possible that some genes may be highly expressed and sequences from them may show up as "over-represented".