Best way to get gene IDs for Salmon transcript output
2
1
Entering edit mode
4.0 years ago
dk0319 ▴ 70

I generated TPM counts from fastq data using salmon. This leaves me with the NM_transcript IDs. I would like to generate the gene symbols from the transcript IDs . Biomart does not recognize transcripts, NCBI Datasets produces an error when I run the entire transcriptome. I have been exploring tximport and tximeta, however, I have run into numerous issues particularly with tximeta not detecting my ref file. Any advice would be greatly appreciated.

Update: I have txiimport and tximeta now running, however they create S4 objects and I am unsure how to make these readable

RNA-Seq R • 2.0k views
ADD COMMENT
1
Entering edit mode
3.9 years ago
dk0319 ▴ 70

tximeta was able to compile all my quant.sf files and summarize to gene level

ADD COMMENT
1
Entering edit mode
4.0 years ago
vkkodali_ncbi ★ 3.8k

I would like to generate the gene symbols from the transcript IDs

If you need just the gene symbols, and not the sequence, you can parse the gene2refseq.gz file. Depending on the age of your input set of accessions, you may not find information for all of them. That's because gene2refseq.gz file is regularly updated and any NM_ accessions that are no longer latest will be absent. You can download the gene2refseq.gz file from here: https://ftp.ncbi.nlm.nih.gov/gene/DATA/

As a first pass, you can extract information for as many NM_ accessions as you can from this file and then use Entrez Direct or NCBI Datasets to get the information for the remaining ones.

ADD COMMENT

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6