Question

How to retain genomic coordinates (chr, start, end, strand) of genes when going from StringTie to Ballgown or DESeq2?

0

Entering edit mode

2.4 years ago

akainth ▴ 20

After re-estimating my transcripts using stringtiemerge with -eB option, I want to do differential expression analysis at the gene level. For that, I tried Ballgown as well as DESeq2 (after getting raw counts using prepDE3). I got the results but my problem is that I only got gene names or gene ids. Use of either Ballgown or DESeq2 leads to lose of genomic features of the genes such as chr, start, end, strand. It is particularly challenging to find the correct coordinates from the original gtf file since multiple transcripts from a gene may or may not be merged into a single gene. Is there a way to have differential GENE expression while retaining their genomic information?

DESeq2 stringtie ballgown RNA-seq • 1.4k views

ADD COMMENT • link updated 2.4 years ago by Matthias Zepper 5.0k • written 2.4 years ago by akainth ▴ 20

0

Entering edit mode

Why don't you just run your ID column from the expression table though Ensembl Biomart and (re)download the chromosomal coordinates to combine them? You can of course also extract them from the gtf using only the rows with gene in the third column.

ADD REPLY • link 2.4 years ago by Matthias Zepper 5.0k

0

Entering edit mode

Hi, Thanks for your reply. What you suggested makes complete sense. However there are two issues: 1) Stringtie, while assembling transcripts, gives its own names to "novel" genes; something like MSTRNG.31. Due to this I wont be able to compare this list with Ensembl. 2) I could go back to the gtf but as you see below multiple transcripts from a gene are consolidated into one or more than one "gene". For example, five transcripts in the gtf file belonging to gene "MSTRNG.31" were consolidated to three genes in the gene level expression file:

Original gtf file:

chr1    StringTie   transcript  181839  200223  1000    -   .   gene_id "MSTRG.31"; transcript_id "MSTRG.31.1";

chr1    StringTie   transcript  184875  200223  1000    -   .   gene_id "MSTRG.31"; transcript_id "MSTRG.31.2";

chr1    StringTie   transcript  185217  200223  1000    -   .   gene_id "MSTRG.31"; transcript_id "ENST00000623083.4"; gene_name "WASH9P"; ref_gene_id "ENSG00000279457.4";

chr1    StringTie   exon    187891  188129  1000    -   .   gene_id "MSTRG.31"; transcript_id "ENST00000612080.1"; exon_number "1"; gene_name "MIR6859-2"; ref_gene_id "ENSG00000273874.1";

chr1    StringTie   transcript  188411  200223  1000    -   .   gene_id "MSTRG.31"; transcript_id "MSTRG.31.5";

Gene level differential expression file:

"MSTRG.31",4556.83430173884,1.45722313220787,0.0823777844496496,17.6895159531578,5.05074171036024e-70,6.88025580006988e-69
"MSTRG.31|WASH9P",288.56634865369,-2.68597317476915,0.526713824343825,-5.09949245800661,3.40565438691135e-07,8.59425585930002e-07
"MSTRG.31|MIR6859-2",19.7950556378761,2.74898896956633,0.476972714033955,5.76340928670111,8.24315673184026e-09,2.29820843965264e-08

ADD REPLY • link 2.4 years ago by akainth ▴ 20

0

Entering edit mode

Thanks for providing the example, this clarifies a lot. In that case, I would try to obtain the sought after coordinates from the gtf file itself.

Approach 1:

My first attempt would be the tool gtfToGenePred from UCSC/Kent:

gtfToGenePred -genePredExt -infoOut=out.infoFile YourGTF.gtf out.GenePred

This converts your .gtf to a .GenePred and a .infoFile. The .infoFile should contain both, your custom IDs and the reference IDs, such that you can replace them with awk in the .GenePred.

The .GenePred can be converted back to a .gtf with GenePredToGtf. Use this new .gtf for getting the raw counts and repeat your analysis.

PS: The .GenePred can be further processed with genePredToBigGenePred and bedToBigBed to create an annotation for the UCSC Genome Browser.

Approach 2:

Alternatively, have a look at gffcompare from CCB:

gffcompare -R -A -M -r ensembl_reference.gtf -s reference.fa -o out YourGTF.gtf

Among the outputs should be a .loci .refmap and .tracking file. Maybe the .loci file is already what you are looking for, but the other two should be useable to match and replace the IDs accordingly?

ADD REPLY • link 2.4 years ago by Matthias Zepper 5.0k