Question

How Can I Determine The Coding Sequence Of A Gene, Given The Genbank Accession Number?

0

Entering edit mode

13.4 years ago

Arman • 0

accession number: NM_001129809

Doing a nucleotide BLAST gives me the whole sequence, just wondering how to find the coding sequence after that.

cds genbank data • 17k views

ADD COMMENT • link updated 13.4 years ago by Hamish ★ 3.3k • written 13.4 years ago by Arman • 0

score 2 · Answer 1 · 2011-12-20

2

Entering edit mode

13.4 years ago

Gjain 5.8k

this should help you:

ADD COMMENT • link 13.4 years ago by Gjain 5.8k

score 2 · Answer 2 · 2011-12-20

Locus NM_001129809 in Genbank is Strongylocentrotus purpuratus lefty (LOC577374), mRNA. 'mRNA' is an abbreviation for 'messenger RNA'. The Genbank entry is annotated with a CDS feature at bases 112..1311. 'CDS' is an abbreviation for 'coding sequence'.

Is the problem that you do not understand the concepts (mRNA, CDS), or that you didn't realise that you could look up NM_001129809 in Genbank, or something else?

You have almost answered your own question, so I'm not sure how to help!

Ram · Answer 3 · 2011-12-21

0

Entering edit mode

13.4 years ago

Manu Prestat 4.1k

Ok, here is an old tool, but maybe the most efficient I know:

Download queryWin client here (mac, Linux and windows supported).
Once launched, open the relevant database (refSeq RNA in your case).
Type ac=NM_001129809 in the search field.
select the result in the list content.
choose "extract seq to file" button -> extract feature region
choose "CDS" region

and that's it!

ADD COMMENT • link 13.4 years ago by Manu Prestat 4.1k

0

Entering edit mode

@Manu sadly the link now gives a 404. Still ACNUC is a great solution for sequence manipulations on a database sequence.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.2 years ago by Hamish ★ 3.3k

0

Entering edit mode

I just tested: it works for me... I don't know any better tool to conduct (at least) this specific task.

ADD REPLY • link 13.2 years ago by Manu Prestat 4.1k

score 0 · Answer 4 · 2012-02-29

There are a wide range of ways of doing this and the choice depends largely on which software you have access to and which you are most comfortable with. However as Keith has pointed out you have to make sure you understand the terminology, and how the various biological constructs are represented in the databases of interest.

The identifier NM_001129809 is from RefSeq (RefSeq is not GenBank). RefSeq uses an extended version of the International Nucleotide Sequence Database Collaboration (INSDC) feature table specification (see "The DDBJ/EMBL/GenBank Feature Table") to describe the various features on the sequence.

The RefSeq nucleotide database in available in a wide range of on-line services, which provide different capabilities. For example:

NCBI Entrez
SRS at EMBL-EBI (and many other SRS servers, see Public SRS Server List)
ACNUC
MRS
DAS servers, see The DAS Registry

In NCBI Entrez and SRS at EMBL-EBI you can information about a specific feature, including the sequence of the feature, by clicking on the feature key (e.g. 'CDS', 'gene', etc.). DAS clients commonly provide support for extracting a sequence for a feature too. Manu's answer describes the procedure for getting the CDS sequence when using ACNUC.

Given the entry data, there are also tools which can extract feature sequence, for example the EMBOSS suite includes the extractfeat program for this purpose.

If you want to do it programmatically, then libraries such as BioJava, BioPerl, BioPython and BioRuby include modules for performing this kind of operation. Alternatively web services (see "Introduction to Web Services") could be used to access or combine various web services (see BioCatalogue) to do this.