Question

removing overrepresented sequences from rna-seq

0

Entering edit mode

5.2 years ago

anna ▴ 10

I have a lot of rna-seq paired end data which have a very good quality, but some of the files have a lot of overrepresented sequences, not adapters. I made a blast of these sequences. Some of them didn't match to anything, and some other seems to be rRNA. I understand that there are divided opinions, and some people say is better to remove the overrepresented sequences, and others says that there's no need to. This time i decided to remove them with cutadapt, because the overrepresented sequences varies from one file to another. But after removing them, the FastQC basic stadistics of these files changed (sequence length 1-150) and NEW overrepresented sequences appeared (i wasn't expecting to obtain more of the initial ones). I'm thinking that maybe i made a mistake with the cutadapt and want to try with trimmomatic, but i can't find in the manual, an option where i can specify the sequence that i want to remove from a specific file (my impression is that with trimmomatic i can remove only adapters that are recognized by the software). Can anyone give me an advice about what to do in order to proceed with the (de novo) assembly?

cutadapt trimmomatic overrepresented sequences • 12k views

ADD COMMENT • link updated 22 months ago by LayneSadler ▴ 90 • written 5.2 years ago by anna ▴ 10

4

Entering edit mode

Personally, I would not remove these overrepresented sequences for the reasons @h.mon explained below. And the fact that you observed "new" overrepresented sequences after removing the original ones means that some sequences will always be overrepresented with respect to others, again because of the reasons explained below.

Having said that, if you really need to remove known/custom sequences from your fastq files and would like to use Trimmomatic for this, you would need to create a multi fasta file and refer to this file when calling Trimmomatic with the ILLUMINACLIP:

java -jar trimmomatic-0.35.jar PE -phred33 ... ILLUMINACLIP:custom-fasta-file.fa:X:X:X ...

The example above assumes that your file, custom-fasta-file.fa is placed under the adapters directory, which itself is within the original Trimmomatic-X.XX directory. Please remember that this is a crude workaround and would only work for sequences at the beginning (5') of your reads.

ADD REPLY • link 5.2 years ago by Haci ▴ 730

0

Entering edit mode

thank you so much for your help!

ADD REPLY • link 5.2 years ago by anna ▴ 10

3

Entering edit mode

As long as you clean adapters (even that is not strictly necessary) you should be able to align your data and move forward. If you do have rRNA contamination (see if it is severe and/or variable among samples) then you would need to check on that to be sure that it is worth going forward with the analysis.

Can anyone give me an advice about what to do in order to proceed with the assembly?

If you are going to de novo assemble the data then just make sure it does not have any extraneous sequence present that should not be there in first place (e.g. adapters).

ADD REPLY • link 5.2 years ago by GenoMax 151k

score 3 · Answer 1 · 2020-03-06

3

Entering edit mode

5.2 years ago

h.mon 35k

RNAseq will always contain over-represented sequences, because certain genes will be overly expressed and, thus, will result in over-represented sequences. If you remove these sequences, you will be removing genes, and your assembly will be less complete and / or more fragmented. Except for adapters, one should not remove any sequences to perform assembly. You may (and this is Trinity default, for example) perform digital normalization prior to assembly, to reduce memory usage and run time.

ADD COMMENT • link 5.2 years ago by h.mon 35k

0

Entering edit mode

Thanks a lot! Now i can move forward

ADD REPLY • link 5.2 years ago by anna ▴ 10

0

Entering edit mode

Not so? I just ran a paired end sample with no overrepresented sequences flagged by FastQC

ADD REPLY • link 22 months ago by LayneSadler ▴ 90