Question

transcripts missing from tx2gene

0

Entering edit mode

18 months ago

dylannicoembros • 0

I am using tximport to prepare quant.sf files generated from salmon for Deseq2 DEG analysis. However I got a message telling me that I have some transcripts missing from tx2gene. I guess the probem is the txgene output table, which I don't know how to create.

To create quants I used a pre-computed index from this link, as suggested in salmon doc, selecting salmon_sa_index:default.

http://refgenomes.databio.org/v3/genomes/splash/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4

How can I know the reference trascriptome used in the pre-computed index ? From this information, I guess I can create my own txgene table and hope the missings disappear.

Currently, I am using this table:

https://github.com/hbctraining/DGE_workshop_salmon/raw/master/data/tx2gene_grch38_ens94.txt

R transcriptome DE DESeq2 • 1.3k views

ADD COMMENT • link 18 months ago by dylannicoembros • 0

score 2 · Accepted Answer · 2023-07-31

How can I know the reference trascriptome used in the pre-computed index ?

You can download the fasta transcriptome file archive (fasta, .fai index and chrome.sizes) used for that index here: http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta_txome?tag=default

This should get you the table you need

$ grep "^>E" 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.fa | sed 's/>//' | awk -F '[ :]' '{OFS="\t"}{print $1,$10,$16}' | head -5
ENST00000631435.1       ENSG00000282253.1       TRBD1
ENST00000415118.1       ENSG00000223997.1       TRDD1
ENST00000448914.1       ENSG00000228985.1       TRDD3
ENST00000434970.2       ENSG00000237235.2       TRDD2
ENST00000632684.1       ENSG00000282431.1       TRBD1

score 2 · Accepted Answer · 2023-07-31

There seems to be a gtf file in the indexing folder (download as in Salmon_index.json doesn't seem to exist) which I assume matches the annotations they used for the indexing. You can build tx2gene from that.

gtf <- rtracklayer::import("/Users/atpoint/Downloads/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.gtf")
gtf <- data.frame(gtf)
tx2gene <- unique(gtf[gtf$type=="transcript",c("transcript_id", "gene_id")])
head(tx2gene)
transcript_id         gene_id
ENST00000456328 ENSG00000223972
ENST00000450305 ENSG00000223972
ENST00000488147 ENSG00000227232
ENST00000619216 ENSG00000278267
ENST00000473358 ENSG00000243485
ENST00000469289 ENSG00000243485

I generally recommend to never use annotation files from non-official sources such as this HBC GitHub. Not saying it's wrong or anything, they have great tutorials and things, but it's just not a constant repository and might be gone tomorrow, and without the code and source files your analysis is not reproducible.