Hi Everyone,
I am aligning raw RNA-seq data from mouse samples for downstream analysis of differentially expressed genes. I would like to do this entirely based on UCSC data, to avoid mix-ups in the nomenclature and to be able to view bedGraph files in the UCSC Genome Browser later on. I used HISAT2 for the alignment, using default parameters and the Mm10 genome indexed by HISAT2.
So here is my question: which genome annotation should I use for assembling/quantifying the expressed genes (StringTie)?
I have retrieved the RefSeq (all) data, as well as the UCSC Genes (knownGenes) data as GTF files from the UCSC Table Browser, but I get extremely high rates of ambiguous gene mappings with both of these (55-65%, based on analysis with QoRTs). I have previously gone through the same process using the ENSEMBL GRCm38 genome indexed by HISAT2 and the corresponding GRCm38.91 annotation provided by the ENSEMBL website, which yielded only around 10% ambiguous mappings.
Can anyone tell me which annotation people generally use for UCSC genes? Or is this an issue with the UCSC genome index from HISAT2?
I'd appreciate any feedback on this.
Thank you!
Thomas
When you say:
Do you mean the same dataset, or a different dataset? Because the problem might be your data, not the annotations? Did you run this problematic dataset with the Ensembl genome + annotation?
Sorry, I should have been more explicit there. Yes, I have run the same data set with Ensembl and got around 90% unique gene assignments. I should also mention that the actual alignment rate was virtually identical between the Ensembl and UCSC HISAT2 alignments, ranging from 75% to 88%. So I think the problem is probably with the annotation rather than the data set or the alignment, but I'm not sure.
More questions, just to rule out some other possibilities: are you using a stranded library prep protocol? When running QoRTs, are you passing the
--stranded
flag?Could you paste some lines of your UCSC gtf here (have you sorted it by position)? Maybe it has redundant annotations - same gene with different names. How did you create the gtf from UCSC?
I get updated annotations from UCSC using the procedures from this GenomeWiki page. In general, I use the Use genePredToGtf with a downloaded genePred table instructions. I never had this problem of ambiguous reads, but the organisms I work with have more sparsely (meaning worst) annotated genomes