Entering edit mode
9.2 years ago
Niek De Klein
★
2.6k
I have 8 PRO-seq samples which all have very similar k-mer plots (one of them: http://pasteboard.co/xuO4V7O.png). At the start of the reads they have very high enrichment.
The used adapter is TGGAATTCTCGGGTGCCAAGG, which has been removed using CutAdapt. I don't know how to go from here to find where this k-mer enrichment comes from. Are there any tools available to go deeper into this, or do you have any suggestions where these enriched k-mers could come from?
I just realized that I have been reading the plot incorrectly. Seems only the first 4 bases are enriched, not the complete k-mers.
Look at the frequency counts on the table under the kmer plots. That will tell you if the enrichments are actually common in the data. Even low counts will come up in the plot, is is scaled to 100% but in fact can be ignored.
This is exactly one of the reasons I made https://github.com/mdshw5/fastqp. Kmer plots are much more interpretable when you can see the absolute fractions and background distribution on the same graph.
Looks nifty!
Looks more informative than fastqc, I'll try to use this. Thanks.
Most of them have very low p-values, some have ~200 when 30 is expected but others also have 1000+ when 40 expected. The obs/exp lays around 30-40, with some 50. I think this is high and shows that enrichment is common, but please correct me if I'm wrong.
It all depends how many reads you start with - for 10 million reads ten thousand is just 0.1 percent - hardly worth looking into.
Is pro-seq RNA seq from 3'-end? No nucleic acid fragmentation involved?
No nucleic fragmentation, it is from the 5' end but fastqc is done after removing adapters, taking the reverse complement, removing rRNA binding reads and reads mapping to repeat regions.