transcripts missing from tx2gene
2
0
Entering edit mode
16 months ago

I am using tximport to prepare quant.sf files generated from salmon for Deseq2 DEG analysis. However I got a message telling me that I have some transcripts missing from tx2gene. I guess the probem is the txgene output table, which I don't know how to create.

To create quants I used a pre-computed index from this link, as suggested in salmon doc, selecting salmon_sa_index:default.

http://refgenomes.databio.org/v3/genomes/splash/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4

How can I know the reference trascriptome used in the pre-computed index ? From this information, I guess I can create my own txgene table and hope the missings disappear.

Currently, I am using this table:

https://github.com/hbctraining/DGE_workshop_salmon/raw/master/data/tx2gene_grch38_ens94.txt

R transcriptome DE DESeq2 • 1.2k views
ADD COMMENT
2
Entering edit mode
16 months ago
GenoMax 147k

How can I know the reference trascriptome used in the pre-computed index ?

You can download the fasta transcriptome file archive (fasta, .fai index and chrome.sizes) used for that index here: http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta_txome?tag=default

This should get you the table you need

$ grep "^>E" 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.fa | sed 's/>//' | awk -F '[ :]' '{OFS="\t"}{print $1,$10,$16}' | head -5
ENST00000631435.1       ENSG00000282253.1       TRBD1
ENST00000415118.1       ENSG00000223997.1       TRDD1
ENST00000448914.1       ENSG00000228985.1       TRDD3
ENST00000434970.2       ENSG00000237235.2       TRDD2
ENST00000632684.1       ENSG00000282431.1       TRBD1
ADD COMMENT
2
Entering edit mode
16 months ago
ATpoint 85k

There seems to be a gtf file in the indexing folder (download as in Salmon_index.json doesn't seem to exist) which I assume matches the annotations they used for the indexing. You can build tx2gene from that.

gtf <- rtracklayer::import("/Users/atpoint/Downloads/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.gtf")
gtf <- data.frame(gtf)
tx2gene <- unique(gtf[gtf$type=="transcript",c("transcript_id", "gene_id")])
head(tx2gene)
transcript_id         gene_id
ENST00000456328 ENSG00000223972
ENST00000450305 ENSG00000223972
ENST00000488147 ENSG00000227232
ENST00000619216 ENSG00000278267
ENST00000473358 ENSG00000243485
ENST00000469289 ENSG00000243485

I generally recommend to never use annotation files from non-official sources such as this HBC GitHub. Not saying it's wrong or anything, they have great tutorials and things, but it's just not a constant repository and might be gone tomorrow, and without the code and source files your analysis is not reproducible.

ADD COMMENT
0
Entering edit mode

Thank you for your time. All works fine now

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6