Fastqc for RNA-Seq data (Illumina 1.9)
4
4
Entering edit mode
8.6 years ago
debitboro ▴ 270

Hi,

I'm new in analyzing RNA-Seq data coming from Illumina plateform. As a first step, I tried to fastqc only one file, and I obtained this result. As you can see, I got failure in 6 modules. I think that the Sequence Duplication Levels module (despite the failure) is ok in RNA-Seq data. My question is to know if I have to trim the first 13 bp from the reads (Per base sequence content show the failure) or not ? And I need some advices from you about the kmer content failure.

Thanks

RNA-Seq Fastqc Kmer content • 12k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
8
Entering edit mode
8.6 years ago

Your data have very high levels of duplication and in the 10,000+ region. That is quite worrisome. In RNA-Seq duplication levels in the tens and hundreds are ok, 10K+ region is not ok. Your most duplicated sequence:

CTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGCGTT

is present 200K times and appears to match ribosomal DNA.

Hence to me it looks like your data has not been rRNA depleted most of the reads will map to rRNA. You may have to make do with just 7% of the data meaning about 2 million reads from the original 39 million. That may or may not be sufficient.

At this point worrying about quality filtering or kmer content is not all that relevant, that won't really make much difference.

ADD COMMENT
0
Entering edit mode

hi Istvan Albert,

thank you for your suggestions. I have performed the mapping of the 1M first reads to the reference human rRNA including 5S, 5.8S, 12S, 16S, 18S, and 28S, I got a result of 83% of mapping. I don't know what I'll do since the data I'm handling come from a NGS experience performed by another person.

thanks

ADD REPLY
0
Entering edit mode

Ouch! This confirms @Istvan's observation (though you have ~4M usable reads).

ADD REPLY
3
Entering edit mode
8.6 years ago
GenoMax 147k

Please do NOT trim the first bases. You may be throwing away perfectly good data.
See this blog post from Dr. Simon Andrews for more background on this. @decosterwouter: You may want to see the post as well.
This must be a stranded RNAseq library (which is indicated by the GC predominance).

Having a red "X" appear on FastQC module does not indicate an automatic failure of data. Simon had to decide on some reasonable intervals for judging the output of various modules and this "observation" becomes a side effect of those choices (I think those limits can be changed by a settings file, if I recall). You would want to consider what kind of data you are dealing with before deciding the actual failures part (other posts here are useful as well).

ADD COMMENT
2
Entering edit mode

The plot in that blog post and the explanation for it are confusing.

Note how the plot is binned after position 10. The first 9 measures are individual measures but the rest are averages binned by some window, not explained properly. The labels don't seem right either. What the heck is 14-15? Of course the line is smoother after it gets binned. That plot (with many others in FastQC) are unscientific IMO.

Random priming

ADD REPLY
0
Entering edit mode

This looks like the FastQC standard plot, never noticed it was binned >10 but it looks it is adding to the confusion.

ADD REPLY
0
Entering edit mode

FastQC does all sorts of weird smoothing. It does it on the GC% plots too. I really hate FastQC for this.

ADD REPLY
0
Entering edit mode

The binning can be turned-off on the command line. FastQC will plot individual cycles in that case. Plots get unwieldy for long runs so the default binning is used.
Can the plots be done better I am sure they can but they serve the basic qualitative purpose now. FastQC code is open source so perhaps someone here can improve that part.

ADD REPLY
0
Entering edit mode

Thanks for the link, will definitely read that. (And change my own trimming strategy!)

ADD REPLY
2
Entering edit mode
8.6 years ago
Michael 55k

You should not trim the first 12-13bp of your data, these are of good quality. I have seen a similar pattern in almost each RNA-seq sample. Per base sequence content seems to fail because of a slight bias in the reagents or protocol for a sequence composition in the first bases, such that are reads having a certain composition in the beginning are slightly enriched. Most of the tests FastQC runs are irrelevant for RNA-seq and mostly cause confusion. As you suggest the sequence duplication level is generally higher in RNA-seq. But the duplication level is quite high in your sample. This could be due to few very highly expressed genes, high content of ribosomal RNA, that is something you should check.

ADD COMMENT
0
Entering edit mode
8.6 years ago

I would suggest to use trimmomatic, which will also take care of the reads running into adapters. I would recommend trimming part of the beginning (~12), perform adapter clipping and then perform QC again. Perhaps removing low quality bases at the end will improve your result. The trimmomatic manual is very helpful. Veel succes met de analyse! http://www.usadellab.org/cms/?page=trimmomatic

ADD COMMENT

Login before adding your answer.

Traffic: 2714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6