I have tried biodbnet and biomart to retrieve gene name/gene id from coding transcript id, but I could not mapped about 22000 transcripts. Is there any other database or resource from where I could map these transcripts.?
Thanks in advance.
Please elaborate on the IDs that you have. There are many different types. Even paste some here, if you can. 'Coding transcript id' does not inform us if you have ENSEMBL transcript IDs, RefSeq IDs, or something else. Also, do you want HGNC / HUGO gene symbols? Thanks, Kevin.
They do appear to have re-analysd all RNA-seq. I have been downloading all TCGA RSEM count data from the GDC Legacy Archive. This was not available for all cancers, previously. Gene name in these files are HGNC IDs, which helps.
I am having a list of transcript ids (coding and non-coding) like uc011kvo.1, uc001aaa.3, uc001aab.3, uc001aai.1, uc001aak.2, uc001aal.1, uc001aam.3, uc001aau.2, uc001aav.3, uc001aaz.2. (27000 approx.)
I just want to convert them to gene symbol and want to separate them into coding and non-coding.
You cannot distinguish coding from non-coding going by HGNC symbols, but you can do this by converting to RefSeq. In RefSeq, a 'NM' prefix indicates a coding gene, whilst 'NR' indicates non-coding. See Table 1 - RefSeq accession numbers and molecule types.
This may be a good case for submitting a ticket to UCSC genome browser support (genome at soe.ucsc.edu, they sometimes participate here but not frequently). They may be able to tell you how to do this classification.
Hi
I have extracted all 24000 transcripts with their gene name, but the data contains coding, non-coding and pseudogenes. How to distinguish among them?. I want only coding genes. The data is like :
Also, as Kevin said above 'NM' prefix indicates a coding gene, whilst 'NR' indicates non-coding, but the data contains other prefixes also BC, AM, X6, CR, AK, AB, FJ etc.
Please elaborate on the IDs that you have. There are many different types. Even paste some here, if you can. 'Coding transcript id' does not inform us if you have ENSEMBL transcript IDs, RefSeq IDs, or something else. Also, do you want HGNC / HUGO gene symbols? Thanks, Kevin.
Hi The Ids are like this: uc001aab.3, uc001aai.1, uc001aam.3, uc001aav.3, uc001aaz.2, uc001aba.1, uc001abc.2 I want HCGN gene symbol.
Thanks all for your reply. I am refering above link by toralmanvar https://webshare.bioinf.unc.edu/public/mRNAseq_TCGA/rsem_ref/unc_knownToLocus.txt There are about 73000 transcripts. But are they coding or non-coding transcripts?
I think the non-coding RNA's were sequenced by a different TCGA center so these should be coding AFAIK.
I have downloaded non-coding RNA from https://www.genenames.org/cgi-bin/statistics and I matched against the ones present here. https://webshare.bioinf.unc.edu/public/mRNAseq_TCGA/rsem_ref/unc_knownToLocus.txt Around 2000 matched, so is there any database for complete non-coding transcripts, so that I can download whole non-codings, and remove them
I checked with some TCGA folks and was told that there were some non-coding sequences present but the list was not comprehensive.
AFAIK GDC redid the entire RNAseq data analysis and must have used recent ID's. Is there a reason you are still using this old data?
They do appear to have re-analysd all RNA-seq. I have been downloading all TCGA RSEM count data from the GDC Legacy Archive. This was not available for all cancers, previously. Gene name in these files are HGNC IDs, which helps.
I am having a list of transcript ids (coding and non-coding) like uc011kvo.1, uc001aaa.3, uc001aab.3, uc001aai.1, uc001aak.2, uc001aal.1, uc001aam.3, uc001aau.2, uc001aav.3, uc001aaz.2. (27000 approx.) I just want to convert them to gene symbol and want to separate them into coding and non-coding.
You cannot distinguish coding from non-coding going by HGNC symbols, but you can do this by converting to RefSeq. In RefSeq, a 'NM' prefix indicates a coding gene, whilst 'NR' indicates non-coding. See Table 1 - RefSeq accession numbers and molecule types.
Pierre manages to convert to both HGNC and RefSeq, here: How to convert UCSC ID to gene symbol
An alternative is to convert to HNC symbol and then look up the gene's biotype in the .
This may be a good case for submitting a ticket to UCSC genome browser support (genome at soe.ucsc.edu, they sometimes participate here but not frequently). They may be able to tell you how to do this classification.
Hi Pierre
Your code is good and running in my linux. But I have a list of 27000 ids, so how to do that?
dowload the file from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/kgXrefOld5.txt.gz , sort and use linux 'join'
Hi I have extracted all 24000 transcripts with their gene name, but the data contains coding, non-coding and pseudogenes. How to distinguish among them?. I want only coding genes. The data is like :
Also, as Kevin said above 'NM' prefix indicates a coding gene, whilst 'NR' indicates non-coding, but the data contains other prefixes also BC, AM, X6, CR, AK, AB, FJ etc.