MultiQC report on sequence duplication levels for RNA-seq data, I think there might be some poor quality issues for my data, if so, how to fix it?
1
1
Entering edit mode
3.1 years ago
FantasticAI ▴ 60

Hi

I got the following MultiQC report for sequence duplication levels. As you can see, there are about over 20% of reads that have been duplicated over 1k times, some are even 10k times, I think there might be some poor quality issues for my data. Am I correct?

Thanks in advance

enter image description here

rna-seq • 7.2k views
ADD COMMENT
2
Entering edit mode

Not necessarily. You should analyze this data to see if you have a problem with PCR duplicates. Otherwise you may simply have some genes expressed at very high levels. Also see: https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/

ADD REPLY
0
Entering edit mode

How do I see whether I have a problem with PCR duplicates? Using IGV?

ADD REPLY
1
Entering edit mode

don't remove duplicates in RNA-Seq data,

duplicate removal is usually performed for variation calling but almost never on RNA-Seq

ADD REPLY
1
Entering edit mode

That is one way. Very first plot in the link I posted shows what a sample with PCR duplicates would look like in IGV.

ADD REPLY
1
Entering edit mode
3.1 years ago

You should expect to see high levels of duplications in RNA-Seq experiments as the coverages will be highly unequal.

The highly expressed transcript may be present in tens of thousands of more copies than lowly expressed ones. Transcripts that are in high abundance may produce a high number of identical reads.

There is no need to "fix it". Align the reads, then visualize the coverages in IGV to identify the source of the potentially high coverages.

ADD COMMENT
0
Entering edit mode

Thank you, for the response, A follow up question regarding to the FastQC report is that the failure of adapter content. I know as RNA-seq data might have different length, so it is reasonable we have adapter detected. But as you can see from the following figure, the percentage of the adapter content is too high, should I worry about it?

enter image description here

ADD REPLY
1
Entering edit mode

But as you can see from the following figure, the percentage of the adapter content is too high, should I worry about it?

Yes and no. Most aligners will soft-clip these as they align so in that sense no. But if you are planning to do any de novo assembly work then you should remove them first.

ADD REPLY
0
Entering edit mode

I'm using STAR to do the alignment, so as you said, STAR will soft-clip the adapter during the alignment if STAR has such funtionality?

ADD REPLY
0
Entering edit mode

I'm just working on RNA-seq analysis and then trying to get the differentially expressed genes by the way

ADD REPLY
0
Entering edit mode

Apply an adapter removal tool, fastp, trimmomatic, cutadapt

ADD REPLY
0
Entering edit mode

Thank you for the response. So there is indeed some adapter contamination in my data?

ADD REPLY
1
Entering edit mode

well, the plot clearly shows adapters, thus the answer is yes.

more importantly, the read length is quite affected as well. Many of your trimmed reads would be under 50bp in length. That would be very short (depending on the host genome, large genomes usually have many repeating regions and would be more sensitive to read lenghts). Thus it would be prudent to also filter by read lengths and keep only reads above a certain length.

In general, when data is visibly affected you should preprocess and clean the reads, even if the aligner would technically be able to do so.

ADD REPLY
0
Entering edit mode

Thank you so much for the informative answer. I apologize to ask for another silly question: Using cutadapt seems to have to specify the adapter sequence, then how do I know what kind of the adapter I have in my Pair-end RNA-seq data? Should I talk to the sequencing facilities about that? I also noticed that the detected overrepresented sequence contains TruSeq Adapter, I guess this is one way to know what kind of adapter I have in my dataset right?

ADD REPLY
1
Entering edit mode

I also noticed that the detected overrepresented sequence contains TruSeq Adapter, I guess this is one way to know what kind of adapter I have in my dataset right?

That is correct.

You can also use these programs to identify them: Identify adapter sequences for trimming from Illumina paired end fastq files

ADD REPLY
1
Entering edit mode

The fastqc tool has a default list of the adapters it locates by default:

https://github.com/s-andrews/FastQC/blob/master/Configuration/adapter_list.txt

To trim the universal adapter you would need to either create a fast file that looks like so:

>adapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Explore the fastp tool that the post by GenoMax also mentions.

I don't like it when tools automatically cut adapters - it is usually better to understand what gets dropped and under when. The fastp tool also has a utility that can identify adapter sequences.

ADD REPLY
0
Entering edit mode

Thank you so much, it is really helpful. I wonder comparing to cutadapt, fastp will be more powerful and more convenient to use currently? Since I know cutadapt is kinda traditional tool and many people use it frequently, but fastp seems to be relatively new.

ADD REPLY
1
Entering edit mode

bbduk/fastp/cutadapt do basic trimming (adapter or otherwise) equally well. It is simply a matter of what tool you become familiar with. Pick one and stick with it. That said there may be special kits where a certain type of trimming/filtering may be needed. Then you will need to pick the right (or kit recommended) tool/options.

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6