Hello,
Recently I received some small rna-seq data to analyze. Since I have never worked with small rna-seq data I'm a bit lost with fastQC quality control results before and after adaptor trimming.
First is per base sequence content is a mess. I'm not sure if it's normal in small rna-seq but it fails on the test pretty hard.
Second GC content is very strange. Before trimming this graph is peaking at same position than the theoretical distribution. But after trimming it's showing two peaks around 58% (theoretical peak) and 78%, I have never seen this before.
Third is sequence length distribution, before adaptor/quality trimming it's normal with all sequences around 76 bp. However after trimming my sequences ranges from 20 to 76, peaking around 20 and 33. Usually in normal RNA-seq data I do not notice a huge change in length distribution like this.
Finally, I have a lot of duplication and over-represented sequences even after adaptor trimming (around 95% of the sequences had adaptors).
From my research I read that fastQC metrics are not very good for small rna-seq thus I should not worry too much about it. Buy my question is, to which point I should not worry? I would like some insight from people that work with small rna-seq datasets.Thanks in advance.
HI Marcon-
Were you able to sort this thing out? I recently got some small rnaseq data too. I saw there was duplication as well as primer sequence overrepresented. I ma using mirdeep2 for analysis but before that I wanted to make sure that the data looks good.
I am looking at 1 sample by using cutadapt (and removing sequences shorter than 15) Although there is a good peak around 22-23 length; there is a smaller peak at 74-75-(should i remove this?) per base quality drops after position 36
and the kmer content has someoverrepresented kmer at 47-50 and 62-69.
Please if anyone knows what it means and what can be done? Please let me know!
Thanks in advance!! Mamta