RNA-seq QC - adapter content and high duplication
1
0
Entering edit mode
2.5 years ago
Simon ▴ 10

Hi, I am currently working on data obtained from an RNA-seq experiment, paired-end with 150bp read length. I performed some quality control by running FASTQC (version 0.11.5) on my fastq files.

I apologize that this may become a long post, but I usually want to include as much details as possible.

To begin with, the per base sequence quality is very good, and also the per sequence quality scores (figures below).

enter image description here enter image description here

However, the adapter content plot looks very strange (below). I have never seen this before when analyzing NGS data. As far as I know it is mostly common to find adapter contamination in the 3' end of reads, at least for the Illumina platform. Here on the other hand, given that I understand this plot correctly, most of the adapters is in the 5' end (actually, already at the second position about 80% of reads have seen adapter sequences) while there is a smaller increase in the accumulative adapter content moving towards the 3' end.

Also, the adapters identified by FASTQ are all labeled as "nextera transposase sequence" which is normally used in ATAC-seq experiments (which we also have done in the scope of the same project). I will get back to the guys preparing the libraries and see what they say about this.

enter image description here

These adapter contaminants are probably the reason for the failing of the other FASTQC modules.

Per Base Sequence Content: While this module typically always fails, I have never seen one as bad as this one. First off, there is about 85%-100% content of T at the first position of reads across all samples. Also, the first two samples (let's call them sample1 and sample2) show 85-90% G content across the entire second half of the reads. This is very clear when examining the plot for Sample 1 alone.

GC content: Accordingly, the GC content plot does not look very good, especially the "peak" with the highest points corresponding to sample1 and sample2.

Overrepresented sequences and duplication levels: And here again the worst sample seem to be sample 1 and 2.

enter image description here enter image description here enter image description here enter image description here enter image description here

Next, I decided to remove the adapter sequences with the bbduk tool using the following command (in a loop that takes the read pairs each time):

bbduk.sh in1=sample_R1.fq.gz" in2=sample_R2.fq.gz" out1=sample_R1_clean.fq.gz out2=sample_R2_clean.fq.gz ref=/.../bbmap/resources/adapters.fa threads=8 ktrim=l k=23 mink=11 hdist=1 qtrim=rl trimq=30 minlength=25 tpe tbo

I removed adapters from left side because according to the FASTQC output above most adapter contamination seems to be there. I also trimmed from both ends for quality < 30 and removed reads with a post-trimming length below 25bp. Through this process, between 25% to 50% of reads where removed from the samples.

So long story short, below is the same FASTQ plots as before but now on the trimmed data. No adapter sequences was detected, but the rest of the results still looks wierd. The Per Base Sequence Content is now somehow inverted, with close to 90% "G" the first half of the reads. And the GC content plot is of course not much better.

enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here

I hope that someone out there got an idea of what might have gone wrong here, and what I could do to solve it. Is my trimming approach wrong, or does it seem that the problem is on the sequencing?

Best regards / Simon

RNA-seq Control FASTQC Quality • 3.0k views
ADD COMMENT
0
Entering edit mode
2.5 years ago
GenoMax 147k

Don't take this the wrong way but something really strange seems to be happening here. Are these plain RNAseq samples (nothing fancy?). Is this mRNAseq or total RNAseq? You appear to have an incredible amount of adapter contamination, which may be explainable if these are degraded samples like FFPE to begin with. Which sequencer is this data from? Likely a 2 color one like NextSeq/NovaSeq. Most of those poly-G sequences are garbage and should be removed using ployG trim option in bbduk.sh. You probably have a small amount of usable data.

I suggest that you see these blog posts from authors of FastQC as a start:

https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/ (though the bias in your case is not the same)
https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/

Note: I have a feeling that these are single cell RNAseq libraries and not bulk RNAseq. If that is the case then FastQC is not going to provide much useful information. Libraries for single cell are very different. For 10x type tech: Read 1 only has 26-28 bp of useful info (cell barcodes + UMI). Read 2 is the actual RNA sequence read. It would not be appropriate to trim these files together with `bbduk.sh`.

ADD COMMENT
0
Entering edit mode

Hi GenoMax, thank you for your quick reply.

This is a plain bulk RNA-seq experiment (Novaseq S4 PE150). I got the information spreadsheet together with the data. RNA extracted from cultured cell lines.

My co-worker who did the library preparation is currently on holidays until next week. I will discuss this with him once he is back. The libraries were sent to an external company for sequencing, so we will probably also get back to them to hear their opinion.

I will have a look at the blog posts you linked to, thank you. Actually, I have read the first one before, and the main conclusion I draw from it is that we probably need to redo the library preparation.

I will also look into the polyG trim option in bbduk. Should I do this separately or can I just add it into the same command as the adapter removal and quality trimming?

ADD REPLY
0
Entering edit mode

we probably need to redo the library preparation.

See what additional information you can glean from your co-worker but it is possible that these libraries failed in some way. You should get some idea from the leftover data, if it is any good. You can add polyG trim options to your bbduk.sh command.

ADD REPLY
0
Entering edit mode

Ok, I will add polyG trim and see if downstream analyses makes any sense.

I just have some additional questions:

  1. From what I have understood, you generally expect adapter sequences at the 3'-end of reads. So how does it come that there seem to be high amount of adapter contamination already at the second position in my data?
  2. FASTQC identifies these adapters as "nextera transposase sequence", which I find strange. I run FASTQC on default settings, that is I didn't supply a custom list of adapters but relied on the build-in ones. Could it be that FASTQC have misinterpret or mislabeled the adapters it finds?
  3. When trimming to remove adapters I used ktrim=l in bbduk because much adapters were found at he left (5') side of the reads. According to the bbduk documentation, using ktrim=l, when an adapter sequence is found it is trimmed together with everything to the left of it. Could this be a problem if adapters are also found at the 3'-end (where they should be) because then the whole read will be removed?
ADD REPLY
0
Entering edit mode

You may have very short or no inserts (i.e. adapter dimers) in your data. Normally there should be no adapter on 5' end of reads in normal illumina libraries unless something non standard is being done.

Log from bbduk.sh should be instructive. You must be losing a large % of data post trimming.

ADD REPLY
0
Entering edit mode

Yes, between 30% and 60% of reads are removed from the samples. The worst is when trimming the ones I named "sample1" and "sample2" in the post above. The bbduk log for these looks like this:

Set threads to 8
maskMiddle was disabled because useShortKmers=true
0.035 seconds.
Initial:
Memory: max=6092m, total=6092m, free=6074m, used=18m

Added 217135 kmers; time:   0.081 seconds.
Memory: max=6092m, total=6092m, free=6067m, used=25m

Input is being processed as paired
Started output streams: 0.095 seconds.
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
Processing time:        193.322 seconds.

Input:                      216736642 reads         32510496300 bases.
QTrimmed:                   3841009 reads (1.77%)   135999358 bases (0.42%)
Polymer-trimmed:            177111967 reads (81.72%)    13191962661 bases (40.58%)
KTrimmed:                   202894294 reads (93.61%)    17201646478 bases (52.91%)
Trimmed by overlap:         4552653 reads (2.10%)   56284856 bases (0.17%)
Total Removed:              201373956 reads (92.91%)    30585893353 bases (94.08%)
Result:                     15362686 reads (7.09%)  1924602947 bases (5.92%)

Time:                           193.503 seconds.
Reads Processed:        216m    1120.07k reads/sec
Bases Processed:      32510m    168.01m bases/sec

So, only about 7% of data is left. I will remove these samples from further analysis. If the QC looks ok then I may be able to extract some information from it.

ADD REPLY
0
Entering edit mode

This is not good news. I hope you find an explanation of what went wrong. It would be difficult to trust that 7% data.

ADD REPLY

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6