Hello everyone, I am analyzing some published dataset. I runner fastqc to check the data quality first. The fastQC reports suggested that the overrepresented sequences seem to be the sequencing index. But I checked the sequencing for the index it indicated but they didn't match. Here is the overrepresented sequences the fastQC reported. AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGATCTCGTATG And it gave me a possible source, TruSeq Adapter, Index 22 (97% over 49bp). I checked the sequences for index 22 online and it is 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGTAATCTCGTATGCCGTCTTCTGCTTG
They didn't match. So my question is that whether I should trim the second sequences or trim the overrepresented sequences fastQC reported. How can I be sure I trim the right sequences? Thank you very much.
They do match. There is the initial A from the ligation and the end of the sequence you found is not present in the adapter sequence of FastQC. Go with the FastQC sequence
Thank you for your reply. So if I am using cutadapt to trim my adaptor sequencing. Should I trim the same sequences for both of my pair-end data. Like cutadapt -a1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGATCTCGTATG -a2 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGATCTCGTATG SRR4011898_1.fastq SRR4011898_2.fastq -o H9_1.fastq H9_2.fastq
Plus, some overrepresent sequences didn't have a clue where they are so how can I trim these sequences? Here is an example. GGCTGCGACATCTGTCACCCCATTGATCGCCAGGGTTGATTCGGCTGATC 66551 0.16715454296615964 No Hit GGCTGGCTAGGCGGGTGTCCCCTTCCTCCCTCACCGCTCCATGTGCGTCC 47823 0.12011587667008239 No Hit
I have one more question. How much percentage of overrepresented sequences should I consider to remove them? I have one dataset, which have many overrepresented sequences, and all of them just counted about 0.1%-0.5%. Do I really need to remove them?
I usually ignore over-represented sequences, especially in RNAseq. It depends a lot on the library preparation and in such low levels it's not a concern. You can test the FastQC on the R2 reads, they should have the same adapter sequence. run cutadapt as you intended (isn't it -a for forward and -A for reverse?)