Question

FastQC of small rna-seq

0

Entering edit mode

9.3 years ago

marcon ▴ 10

Hello,

Recently I received some small rna-seq data to analyze. Since I have never worked with small rna-seq data I'm a bit lost with fastQC quality control results before and after adaptor trimming.

First is per base sequence content is a mess. I'm not sure if it's normal in small rna-seq but it fails on the test pretty hard.

Second GC content is very strange. Before trimming this graph is peaking at same position than the theoretical distribution. But after trimming it's showing two peaks around 58% (theoretical peak) and 78%, I have never seen this before.

Third is sequence length distribution, before adaptor/quality trimming it's normal with all sequences around 76 bp. However after trimming my sequences ranges from 20 to 76, peaking around 20 and 33. Usually in normal RNA-seq data I do not notice a huge change in length distribution like this.

Finally, I have a lot of duplication and over-represented sequences even after adaptor trimming (around 95% of the sequences had adaptors).

From my research I read that fastQC metrics are not very good for small rna-seq thus I should not worry too much about it. Buy my question is, to which point I should not worry? I would like some insight from people that work with small rna-seq datasets.Thanks in advance.

fastQC miRNA sRNA rna-seq • 5.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.3 years ago by marcon ▴ 10

0

Entering edit mode

HI Marcon-

Were you able to sort this thing out? I recently got some small rnaseq data too. I saw there was duplication as well as primer sequence overrepresented. I ma using mirdeep2 for analysis but before that I wanted to make sure that the data looks good.

I am looking at 1 sample by using cutadapt (and removing sequences shorter than 15) Although there is a good peak around 22-23 length; there is a smaller peak at 74-75-(should i remove this?) per base quality drops after position 36

and the kmer content has someoverrepresented kmer at 47-50 and 62-69.

Please if anyone knows what it means and what can be done? Please let me know!

Thanks in advance!! Mamta

ADD REPLY • link 9.1 years ago by datanerd ▴ 520

Ram · Answer 1 · 2016-01-09

FastQC gives number of bases that are similar to adapter sequences, which is normally about 25 to 30 nucleotides of adapter sequences. As your sequence is 75 bases, it may be even longer adapter sequences. Fetch those adapter sequences and trim those using trimmomatic (or other tools), and you have your sequences. As it is small RNA library, you may want to take only those sequences, that are between 18-35 nt long, or something around that range, you might have better idea if you have prepared library. The problems you mentioned about duplication and GC content are generally common in small RNA libraries.