Question

How does RefSeq get their transcript sequences?

0

Entering edit mode

10.7 years ago

pwg46 ▴ 540

I have been looking at the rna.fa.gz file in RefSeq's database. For the majority of the NM's, the sequences don't start with ATG. So, I thought perhaps the rna file was containing the entire sequence of the mRNA and not just the coding slices. So, I took an NM, the chromosomal CDS start position and he chromosomal first exon start position (all of which I got from another data file provided by Refseq DB) to see where in the NM's sequence the coding region should begin. But even then, still no 'ATG'. Also, when there is a perfect map between an NM and ENST, the NM's sequence given in the rna.fa file is completely different than the ENST's sequence given by Ensembl's own data file--The chromosomal positions of the ENST and NM perfectly match on the same chromosome (and on the same grch38 build), yet somehow the sequences they each give in their own data files are different. Could someone please clarify how RefSeq is coming up with their transcript sequences?

atg sequence refseq nm identifier • 4.0k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by pwg46 ▴ 540

1

Entering edit mode

Here is a link with the detailed process of curating RefSeq transcripts.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by roy.granit ▴ 890

0

Entering edit mode

There's no reason to expect a transcript sequence to start with ATG, in fact it usually won't. Unless you're looking at non-coding sequences, they should typically contain an ATG, though. Can you give an example of a mismatch between the refseq and corresponding Ensembl sequence for the exact same transcript?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Devon Ryan 105k