After trimming and QC'ing RNAseq for adapters with trim_galore
(cutadapt
+fastqc
), should I remove over-represented sequences that FastQC identifies as possible adapter/primer source?
[Is there a high chance that over-represented seqs will mess up my downstream gene quantification data; do I even care?]
If so, is there a way to do so automatically similar to the following option
--action {trim,retain,mask,lowercase,none} (default: trim)
Specify what to do if an adapter match was found
Paired End 1
Sequence: [50 base sequence]
Count: 46619
Percentage: 0.11
Possible Source: TruSeq Adapter, Index 13 (97% over 39bp)
TruSeq Adapter, Index 13 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTG https://dnatech.genomecenter.ucdavis.edu/wp-content/uploads/2013/06/illumina-adapter-sequences_1000000002694-00.pdf
Paired End 2
Sequence: [50 base sequence]
Count: 50457
Percentage: 0.12
Possible Source: Illumina Single End PCR Primer 1 (100% over 50bp)
Original Command
trim_galore --illumina --fastqc --paired file_1.fq.gz file_2.fq.gz
- If it's a 100% match then why isn't it removed?
- Both of the identified sequences start with the same 11 bases
ATCGGAAGAGC
- When I tested a different paired end sample it had no ove-rrepresented sequences
- 46K and 50K counts are really high. Only possible if these are MT RNA or rRNA?
There is a core sequence present in all Illumina indexed adapters. Once that sequence is found you should remove all sequence to the 3'-end of that sequence. I am not sure what you are asking here. I am not a
trim_galore
user but if it understands Illumina adapters then action to use is "trim".It is not necessary to have ove-represented sequences identified as a matter of course for RNAseq data. FastQC uses the following limit when scanning for these sequence
If the over-represented sequences are not identified as library adapters then leave them in the dataset for further analysis.