Hi
I have Illumina mRNA-seq samples where it seems because of low RINs (2-4) in a bunch of them compared to the others, I am getting very widely varying mapping rates (15%-70%) and therefore counts per sample (e.g. 8,000,000 mapped reads vs 40,000,000). Plus I can't really use RIN/mapping rate as a covariate because it is very confounded with a group of interest. Looking at excluding another 20 low mapping rate samples atm.
Is there a preferred way of analyzing this type of data? If I do the usual VST through DESEQ2 I get a cluster of samples with irregular high expression of a lot of genes, also the ones with low numbers of overall sample counts, presumably this is because of what I describe above. I was wondering if quantile normalisation would help as it uses rankings to make the samples more comparable, this could be the kind of extreme situation where it may help... Are there any other ideas?
I also used Salmon to quantify the data using the gc bias and validate mappings flags. Reads are 150bp PE. If I do not run gc bias correction and validate mappings, the mapping rates go up about 10%, but I suspect the quality of those mappings is reduced so currently am using the data with these flags.
Thanks,
Chris
And if you do run gc bias correction?
I'm not sure whether it is validate mappings or gc bias correction that is lowering the mapping rates, but it seems those flags are generally recommended and I have used both. I also tried reducing read length to 75bp or mapping just single ended reads, but this did not increase the mapping rates. For some reason illumina seem to think shorter (50bp-75bp) single ended reads are a bit preferred for transcriptome mRNA quantification because they don't span splice junctions (http://emea.support.illumina.com/bulletins/2017/04/considerations-for-rna-seq-read-length-and-coverage-.html).