Question

Convert Agi Locus To Genbank Or Embl Format

0

Entering edit mode

14.4 years ago

Gvj ▴ 470

Hi All, I have a list of AGI locus and want to get their gene structure in genbank or EMBL format. Since TAIR only give in gff3 format, I want a method either to convert gff3 to genbank/embl or a method to get the NCBI acc.No of those AGI locus. I have found one file ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_NCBI_GENEID_mapping under TAIR but its not completely true ( or I didn't understand it completely)

format conversion genbank gff • 7.9k views

ADD COMMENT • link updated 14.4 years ago by Ladan • 0 • written 14.4 years ago by Gvj ▴ 470

score 2 · Answer 1 · 2010-11-07

2

Entering edit mode

14.4 years ago

Neilfws 49k

The file that you describe contains 2 columns; the second is the TAIR locus tag and the first is the NCBI Entrez Gene database ID. The Gene ID is not the same as an accession number or ID, but it will get you there.

There may well be a file, at the Arabidopsis site or elsewhere, which links Gene ID to GenBank accession. If not, you can use BioMart, something like this:

Click MARTVIEW (top menu)
Choose "EMSEMBL PLANT 6 (EBI UK)" as database
Choose "Arabidopsis thaliana genes (TAIR9)" as dataset
Click "Filters" (left menu); expand GENE; check ID list limit and choose "Entrez Gene ID(s)"
Either paste or upload Gene IDs (column 1 in your file)
Click "Attributes" (left menu); expand EXTERNAL; check "RefSeq DNA ID"
Click "Results" (top left menu)

After some time, this should return results that you can download as plain ASCII text. For example, using Gene ID 2745418 (AT2G01175), I get back "NM_201659".

You can now take your new list of accessions off to Batch Entrez, upload them and retrieve the results in GenBank format.

This is just one solution (relying on both BioMart and Batch Entrez working well); there are plenty of other potential ways to convert between IDs, including programmatic methods.

ADD COMMENT • link 14.4 years ago by Neilfws 49k

0

Entering edit mode

Thank you very much.. something I found strange is that NCBI entry only contain CDS not intron/exon information (eg:AT2G32460 (accNo:NM_128805) has 3 exons but not mentioned in .gb format). Why it is so? Is this because I am downloading from nucleotide database? I want all features of genes. Any suggestions ??

ADD REPLY • link 14.4 years ago by Gvj ▴ 470

0

Entering edit mode

That is strange. Looks like the record has the complete (i.e. includes non-coding) mRNA, but no "parts". I am not sure why.

ADD REPLY • link 14.4 years ago by Neilfws 49k

0

Entering edit mode

I guess its not mandatory to have exons,UTR .. features in genbank formate. That would be a reason. Nice to know the BioMart way, but I think programmatic way of converting gff to genbank is the only solution for me

ADD REPLY • link 14.4 years ago by Gvj ▴ 470

Ram · Answer 2 · 2010-11-07

0

Entering edit mode

14.4 years ago

Lars Juhl Jensen 11k

To make your question a bit more general, what you are asking for is a way to make a Genbank (or EMBL) file based on a GFF file and its associated FASTA sequence file. Solutions to that can be found here

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.4 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

That script only convert gff3 which doesn't specify UTRs explicitly.

ADD REPLY • link 14.4 years ago by Gvj ▴ 470

score 0 · Answer 3 · 2010-11-16

0

Entering edit mode

14.4 years ago

Ladan • 0

Dear , How can I find the GeneID or locus_tag of genes? Sincerely yours laleh and Ladan