How Can I Determine The Coding Sequence Of A Gene, Given The Genbank Accession Number?
4
0
Entering edit mode
12.9 years ago
Arman • 0

accession number: NM_001129809

Doing a nucleotide BLAST gives me the whole sequence, just wondering how to find the coding sequence after that.

cds genbank data • 17k views
ADD COMMENT
2
Entering edit mode
12.9 years ago

Locus NM_001129809 in Genbank is Strongylocentrotus purpuratus lefty (LOC577374), mRNA. 'mRNA' is an abbreviation for 'messenger RNA'. The Genbank entry is annotated with a CDS feature at bases 112..1311. 'CDS' is an abbreviation for 'coding sequence'.

Is the problem that you do not understand the concepts (mRNA, CDS), or that you didn't realise that you could look up NM_001129809 in Genbank, or something else?

You have almost answered your own question, so I'm not sure how to help!

ADD COMMENT
0
Entering edit mode
12.9 years ago

Ok, here is an old tool, but maybe the most efficient I know:

  1. Download queryWin client here (mac, Linux and windows supported).

  2. Once launched, open the relevant database (refSeq RNA in your case).

  3. Type ac=NM_001129809 in the search field.

  4. select the result in the list content.

  5. choose "extract seq to file" button -> extract feature region

  6. choose "CDS" region

and that's it!

ADD COMMENT
0
Entering edit mode

@Manu sadly the link now gives a 404. Still ACNUC is a great solution for sequence manipulations on a database sequence.

ADD REPLY
0
Entering edit mode

I just tested: it works for me... I don't know any better tool to conduct (at least) this specific task.

ADD REPLY
0
Entering edit mode
12.7 years ago
Hamish ★ 3.3k

There are a wide range of ways of doing this and the choice depends largely on which software you have access to and which you are most comfortable with. However as Keith has pointed out you have to make sure you understand the terminology, and how the various biological constructs are represented in the databases of interest.

The identifier NM_001129809 is from RefSeq (RefSeq is not GenBank). RefSeq uses an extended version of the International Nucleotide Sequence Database Collaboration (INSDC) feature table specification (see "The DDBJ/EMBL/GenBank Feature Table") to describe the various features on the sequence.

The RefSeq nucleotide database in available in a wide range of on-line services, which provide different capabilities. For example:

In NCBI Entrez and SRS at EMBL-EBI you can information about a specific feature, including the sequence of the feature, by clicking on the feature key (e.g. 'CDS', 'gene', etc.). DAS clients commonly provide support for extracting a sequence for a feature too. Manu's answer describes the procedure for getting the CDS sequence when using ACNUC.

Given the entry data, there are also tools which can extract feature sequence, for example the EMBOSS suite includes the extractfeat program for this purpose.

If you want to do it programmatically, then libraries such as BioJava, BioPerl, BioPython and BioRuby include modules for performing this kind of operation. Alternatively web services (see "Introduction to Web Services") could be used to access or combine various web services (see BioCatalogue) to do this.

ADD COMMENT

Login before adding your answer.

Traffic: 1883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6