Hi, I am currently working on data obtained from an RNA-seq experiment, paired-end with 150bp read length. I performed some quality control by running FASTQC (version 0.11.5) on my fastq files.
I apologize that this may become a long post, but I usually want to include as much details as possible.
To begin with, the per base sequence quality is very good, and also the per sequence quality scores (figures below).
However, the adapter content plot looks very strange (below). I have never seen this before when analyzing NGS data. As far as I know it is mostly common to find adapter contamination in the 3' end of reads, at least for the Illumina platform. Here on the other hand, given that I understand this plot correctly, most of the adapters is in the 5' end (actually, already at the second position about 80% of reads have seen adapter sequences) while there is a smaller increase in the accumulative adapter content moving towards the 3' end.
Also, the adapters identified by FASTQ are all labeled as "nextera transposase sequence" which is normally used in ATAC-seq experiments (which we also have done in the scope of the same project). I will get back to the guys preparing the libraries and see what they say about this.
These adapter contaminants are probably the reason for the failing of the other FASTQC modules.
Per Base Sequence Content: While this module typically always fails, I have never seen one as bad as this one. First off, there is about 85%-100% content of T at the first position of reads across all samples. Also, the first two samples (let's call them sample1 and sample2) show 85-90% G content across the entire second half of the reads. This is very clear when examining the plot for Sample 1 alone.
GC content: Accordingly, the GC content plot does not look very good, especially the "peak" with the highest points corresponding to sample1 and sample2.
Overrepresented sequences and duplication levels: And here again the worst sample seem to be sample 1 and 2.
Next, I decided to remove the adapter sequences with the bbduk tool using the following command (in a loop that takes the read pairs each time):
bbduk.sh in1=sample_R1.fq.gz" in2=sample_R2.fq.gz" out1=sample_R1_clean.fq.gz out2=sample_R2_clean.fq.gz ref=/.../bbmap/resources/adapters.fa threads=8 ktrim=l k=23 mink=11 hdist=1 qtrim=rl trimq=30 minlength=25 tpe tbo
I removed adapters from left side because according to the FASTQC output above most adapter contamination seems to be there. I also trimmed from both ends for quality < 30 and removed reads with a post-trimming length below 25bp. Through this process, between 25% to 50% of reads where removed from the samples.
So long story short, below is the same FASTQ plots as before but now on the trimmed data. No adapter sequences was detected, but the rest of the results still looks wierd. The Per Base Sequence Content is now somehow inverted, with close to 90% "G" the first half of the reads. And the GC content plot is of course not much better.
I hope that someone out there got an idea of what might have gone wrong here, and what I could do to solve it. Is my trimming approach wrong, or does it seem that the problem is on the sequencing?
Best regards / Simon
Hi GenoMax, thank you for your quick reply.
This is a plain bulk RNA-seq experiment (Novaseq S4 PE150). I got the information spreadsheet together with the data. RNA extracted from cultured cell lines.
My co-worker who did the library preparation is currently on holidays until next week. I will discuss this with him once he is back. The libraries were sent to an external company for sequencing, so we will probably also get back to them to hear their opinion.
I will have a look at the blog posts you linked to, thank you. Actually, I have read the first one before, and the main conclusion I draw from it is that we probably need to redo the library preparation.
I will also look into the polyG trim option in bbduk. Should I do this separately or can I just add it into the same command as the adapter removal and quality trimming?
See what additional information you can glean from your co-worker but it is possible that these libraries failed in some way. You should get some idea from the leftover data, if it is any good. You can add polyG trim options to your
bbduk.sh
command.Ok, I will add polyG trim and see if downstream analyses makes any sense.
I just have some additional questions:
You may have very short or no inserts (i.e. adapter dimers) in your data. Normally there should be no adapter on 5' end of reads in normal illumina libraries unless something non standard is being done.
Log from
bbduk.sh
should be instructive. You must be losing a large % of data post trimming.Yes, between 30% and 60% of reads are removed from the samples. The worst is when trimming the ones I named "sample1" and "sample2" in the post above. The bbduk log for these looks like this:
So, only about 7% of data is left. I will remove these samples from further analysis. If the QC looks ok then I may be able to extract some information from it.
This is not good news. I hope you find an explanation of what went wrong. It would be difficult to trust that 7% data.