How Can I Get Coding Sequences For A List Of Genes?
4
4
Entering edit mode
14.2 years ago
David B ▴ 70

I am working with bacterial genomes and I have a long list of genes, from different organisms.

The list format is replicon accessions (e.g. NC12345) and locus tag (e.g. Abc1234). I have no other identifiers for the gene.

Most of the genes are coding genes, but not all. I would like to get the coding sequence for each gene that has one.

how can I do that (preferably in Perl)?

Best, Dave

perl sequence • 7.9k views
ADD COMMENT
10
Entering edit mode
14.2 years ago
Neilfws 49k

You have 2 problems to solve here: fetching the sequences and parsing them to extract CDS.

To deal with the second problem: install Bioperl, if you have not already done so. Then, take a look at the SeqIO how-to. If you installed the accessory scripts, there's a handy utility named bp_extract_feature_seq, which you can run like this:

bp_extract_feature_seq -i NC_005213.gb --format genbank --feature=CDS -o NC_005213.fa

It will write a fasta file containing the coding sequences of all CDS features.

You'll want to automate the process of fetching sequences by looping through the replicon accessions. Here's some sample code from the Bioperl tutorial which will fetch a sequence from RefSeq and write it in GenBank format:

#!/usr/bin/perl

use strict;
use Bio::Perl;

my $seq_object = get_sequence('refseq', "NC_005213");
write_sequence(">NC_005213.gb", 'genbank', $seq_object);

It should not be too hard to write a loop into that, using the NC_* accessions from your file.

ADD COMMENT
0
Entering edit mode

+1 Thanks! Just two more things:

  1. Is there an equivalent perl function to bp_extract_feature_seq? I would like to this all from a perl script, and it would be nicer to call a function instead of using system("bp_extract_feature_seq",..)
  2. How do I get the CDS for a specific locus tag (not the entire CDSs in the replicon)?
ADD REPLY
0
Entering edit mode

I would take a look at the code in bp_extract_feature_seq - it's just a perl script. I only mentioned it for the convenience, but it should be easy to build something around SeqIO, if you study the how-to.

ADD REPLY
5
Entering edit mode
14.2 years ago

An addendum to Jorge Amigo's answer. Use BioMart as he describes. Additionally, set the gene type to be 'protein_coding'. Then click the 'Perl' button on the top right. Edit the script that it returns to include a loop over your set of organisms. e.g.

$query->setDataset("<my organism>");
$query->addFilter("refseq_peptide", ["<id 1>","<id 2>"]);
$query->addFilter("biotype", ["protein_coding"]);

You may have to normalise your IDs so that they are all of the same type (e.g. Refseq)

ADD COMMENT
0
Entering edit mode

indeed Keith. thumbs up! editing the Perl code that BioMart definitely increases even more its query capabilities. looping through the set of organisms David has like you describe should be straight-forward.

ADD REPLY
4
Entering edit mode
14.2 years ago

you can use BioMart to retrieve such information. I would select the latest version of the Ensembl Genes database (currently #59) and then select the appropriate organism. you will then only have to enter the list of IDs you have, using the "ID list limit" filter and selecting from the combo box the type of IDs you are going to enter for that particular organism. you will have to do it organism by organism though.

ADD COMMENT
0
Entering edit mode

arrr... this will not work for me as I have dozens of organisms.

ADD REPLY
0
Entering edit mode

I've voted Keith's answer, as I think that it'll surely solve your problems easily. good luck ;)

ADD REPLY
2
Entering edit mode
14.2 years ago

From the NCBI, you can use ESearch and/or EFetch to retrieve an XML description of your record (search biostar for some examples).

Here is an example.

Then, using XSLT/XPATH, you can extract the CDS (again, search biostar for some examples).

XPATH example:

/GBSet/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='translation']/GBQualifier_value

but I'm afraid the annotation of the NCBI are not homogeneous, so the word 'translation' might not be the only one to match the CDS.

ADD COMMENT
0
Entering edit mode

+1 Thanks, it's great to know this option exists.

ADD REPLY

Login before adding your answer.

Traffic: 2597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6