The Way To Write Script To Validate If The Given Transcript Id Is The Latest Version
1
1
Entering edit mode
12.7 years ago
jessada ▴ 150

My data at VariBench has its transcript ID along with the version and I want pick only the ones with latest transcript ID. Are there any place that I can download transcript database or any online webservice.

transcript snp ncbi • 3.0k views
ADD COMMENT
2
Entering edit mode
12.7 years ago

Those look like Refseq transcript IDs. You can download the current version of Refseq here: Refseq vertebrate mammalian. The *.rna.gbff.gz files in this directory contain a GenBank record for each Refseq ID and should specify the latest version. You would just need to grab the 'ACCESSION' and 'VERSION' values for each record. For example:

ACCESSION   XM_002714324
VERSION     XM_002714324.1  GI:291395911

Another option would be to use the NCBI E-utilities. For example, use esearch to get the uid for each Refseq ID and use it again to get the Refseq ID with current latest version number.

The following returns an XML for 'NM_000014' (note that no version is specified here) containing the uid '66932946':

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=NM_000014

The following returns an XML for the uid '66932946':

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nuccore&id=66932946

This XML contains a line: gi|66932946|ref|NM_000014.4|[66932946]

Telling you that currently this Refseq transcript is on version 4. Of course, you would need a script to automate this process for the number of records that you have.

ADD COMMENT
0
Entering edit mode

Can you explain more about the directory structure of the files in REfseq vertebrate mammalian? I saw 100 sets of files there. And it has around 6-10 files in each set. If I only need mRNA transcript of human, which groups of files should I d/l? FYI, I'm really new in biology but very strong background in com sci.

ADD REPLY
0
Entering edit mode

There are 6 data sets, each with a specific file format represented in the Refseq FTP directory. Each of these 6 data sets is divided into 144 blocks to avoid large blocks. This is sort of explained here: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/NOTICE_OF_FILE_FORMAT_CHANGE. The six file types are: genomic.fna (genome data in fasta nucleic acid format), genomic.gbff (genome data in genbank flat file format), protein.faa (protein data as fasta amino acid), protein.gpff (protein data as genprot flat file), rna.fna (rna data as fasta nucleic acid), rna.gbff (rna as genbank flat file)

ADD REPLY
0
Entering edit mode

If you go the Refseq FTP route, it might be more convenient to work with the human specific files here: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/

ADD REPLY

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6