Hello, I would like to know if the AT content of an RNA-Seq data play any role in determining if our data was contaminated ? If so, then how would we find it? Also, I would like to know a way to calculate it. I am new to RNA-seq so any help would be appreciated. Thanks.
Hi, Thanks for your answer. The alignment is around 30% only. My next step was to do a BLAST to check it. Do you know a way to check the AT content or should I just count them in my assembled file ?
FastQC will give over-represented sequences with some Illumina library sequences (if you ended sequencing a bunch of adapter sequences, for example), and it can also include polyA/polyT sequences. It will also give some overall GC sequence content information.
FastQ Screen can also help with identifying the origin for some sequences.
With a 30% alignment rate, you may have to further reduce your number of unaligned reads for a de novo assembly (or run multiple assemblies on different subsamples of reads, since some may happen to be better at assembling informative regions for potential contamination). However, that does sound like a low alignment rate.
Thanks. Will try that. Also, Can you tell me, what would be the optimal AT content percentage for a good data in a human transcriptome data?
I am not sure about the optimal AT percentage, but I would usually expect an alignment rate greater than 90% (and I think the polyA/polyT over-represented sequences will probably be at less than 1% each)