After re-estimating my transcripts using stringtiemerge with -eB option, I want to do differential expression analysis at the gene level. For that, I tried Ballgown as well as DESeq2 (after getting raw counts using prepDE3). I got the results but my problem is that I only got gene names or gene ids. Use of either Ballgown or DESeq2 leads to lose of genomic features of the genes such as chr, start, end, strand. It is particularly challenging to find the correct coordinates from the original gtf file since multiple transcripts from a gene may or may not be merged into a single gene. Is there a way to have differential GENE expression while retaining their genomic information?
Why don't you just run your ID column from the expression table though Ensembl Biomart and (re)download the chromosomal coordinates to combine them? You can of course also extract them from the gtf using only the rows with gene in the third column.
Hi, Thanks for your reply. What you suggested makes complete sense. However there are two issues: 1) Stringtie, while assembling transcripts, gives its own names to "novel" genes; something like MSTRNG.31. Due to this I wont be able to compare this list with Ensembl. 2) I could go back to the gtf but as you see below multiple transcripts from a gene are consolidated into one or more than one "gene". For example, five transcripts in the gtf file belonging to gene "MSTRNG.31" were consolidated to three genes in the gene level expression file:
Original gtf file:
Gene level differential expression file:
Thanks for providing the example, this clarifies a lot. In that case, I would try to obtain the sought after coordinates from the gtf file itself.
Approach 1:
My first attempt would be the tool
gtfToGenePred
from UCSC/Kent:This converts your .gtf to a
.GenePred
and a.infoFile
. The.infoFile
should contain both, your custom IDs and the reference IDs, such that you can replace them with awk in the.GenePred
.The
.GenePred
can be converted back to a.gtf
withGenePredToGtf
. Use this new .gtf for getting the raw counts and repeat your analysis.PS: The
.GenePred
can be further processed withgenePredToBigGenePred
andbedToBigBed
to create an annotation for the UCSC Genome Browser.Approach 2:
Alternatively, have a look at
gffcompare
from CCB:Among the outputs should be a
.loci
.refmap
and.tracking
file. Maybe the.loci
file is already what you are looking for, but the other two should be useable to match and replace the IDs accordingly?