Hi,
Is there a way to retrieve the longest transcripts for a list of 4000 genes? I want to use the NM_transcript ids from NCBI. I just need the NM_id for the longest transcript.
Hi,
Is there a way to retrieve the longest transcripts for a list of 4000 genes? I want to use the NM_transcript ids from NCBI. I just need the NM_id for the longest transcript.
Well, I am not that expert in such issues and until someone answer you properly, here is my comment. Go to NCBI, select Nucleotide and then type anything and press enter to activate nucleotide database search or to see Advanced option. Then, select sequence length and enter 70000 to 999999. (of course you have to enter other option to adjust the search for only mRNA). I got all the above from the following website where I learned this trick to search for longest (or shortest DNA/mRNA/protein) sequences:
http://wiki.bits.vib.be/index.php/Exercises_on_Genbank
(look under Exercise 3).
Regards,
Mohamed
So this information is not just easily retrievable from NCBI or UCSC?
I think it's possible in ENSEMBL via biomart. In filters select refseq id and put your id list in it. And in attributes select transcript length.
It should be easily done through their API
Hi NicoBxl,
I am working on a similar project to the one discussed above, and I feel that I am very close to being able to do this according to your directions above. However, I am not able to find "transcript length" in attributes, and it also does not seem to be pulling up multiple RefSeq ID's for genes with many (i.e. DMD). Any suggestions?
Thank you!
Renee Bend
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Start by breaking down the problem into sub-problems, like so:
Solve these, and you've figured out how to automate pretty much any querying using NCBI.
Or, explore UCSC Genome Browser's underlying MySQL tables and check if SQL can help you make your job easier.
What Ram said, though in this particular instance the UCSC database might be annoying to use, since you have to calculate the transcript widths yourself (the most useful coordinates are the exon start/stop positions and those are all comma separated instead of being different entries). You might have better luck using biomart. You'll have to process the query, but it's simple enough to get the length of a large number of transcripts.
I will use Biomart tool to provide me with transcript start & stop, together with the gene names and the NM_id. I will then use my own script to calculate the length and keep only the longest transcript