I have a lot of rna-seq paired end data which have a very good quality, but some of the files have a lot of overrepresented sequences, not adapters. I made a blast of these sequences. Some of them didn't match to anything, and some other seems to be rRNA. I understand that there are divided opinions, and some people say is better to remove the overrepresented sequences, and others says that there's no need to. This time i decided to remove them with cutadapt, because the overrepresented sequences varies from one file to another. But after removing them, the FastQC basic stadistics of these files changed (sequence length 1-150) and NEW overrepresented sequences appeared (i wasn't expecting to obtain more of the initial ones). I'm thinking that maybe i made a mistake with the cutadapt and want to try with trimmomatic, but i can't find in the manual, an option where i can specify the sequence that i want to remove from a specific file (my impression is that with trimmomatic i can remove only adapters that are recognized by the software). Can anyone give me an advice about what to do in order to proceed with the (de novo) assembly?
Personally, I would not remove these overrepresented sequences for the reasons @h.mon explained below. And the fact that you observed "new" overrepresented sequences after removing the original ones means that some sequences will always be overrepresented with respect to others, again because of the reasons explained below.
Having said that, if you really need to remove known/custom sequences from your fastq files and would like to use
Trimmomatic
for this, you would need to create a multi fasta file and refer to this file when callingTrimmomatic
with theILLUMINACLIP
:The example above assumes that your file,
custom-fasta-file.fa
is placed under theadapters
directory, which itself is within the originalTrimmomatic-X.XX
directory. Please remember that this is a crude workaround and would only work for sequences at the beginning (5') of your reads.thank you so much for your help!
As long as you clean adapters (even that is not strictly necessary) you should be able to align your data and move forward. If you do have rRNA contamination (see if it is severe and/or variable among samples) then you would need to check on that to be sure that it is worth going forward with the analysis.
If you are going to de novo assemble the data then just make sure it does not have any extraneous sequence present that should not be there in first place (e.g. adapters).