Hi All,
I have a list of AGI locus and want to get their gene structure in genbank or EMBL format. Since TAIR only give in gff3 format, I want a method either to convert gff3 to genbank/embl or a method to get the NCBI acc.No of those AGI locus. I have found one file ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_NCBI_GENEID_mapping under TAIR but its not completely true ( or I didn't understand it completely)
The file that you describe contains 2 columns; the second is the TAIR locus tag and the first is the NCBI Entrez Gene database ID. The Gene ID is not the same as an accession number or ID, but it will get you there.
There may well be a file, at the Arabidopsis site or elsewhere, which links Gene ID to GenBank accession. If not, you can use BioMart, something like this:
Click MARTVIEW (top menu)
Choose "EMSEMBL PLANT 6 (EBI UK)" as database
Choose "Arabidopsis thaliana genes (TAIR9)" as dataset
Click "Filters" (left menu); expand GENE; check ID list limit and choose "Entrez Gene ID(s)"
Either paste or upload Gene IDs (column 1 in your file)
Click "Attributes" (left menu); expand EXTERNAL; check "RefSeq DNA ID"
Click "Results" (top left menu)
After some time, this should return results that you can download as plain ASCII text. For example, using Gene ID 2745418 (AT2G01175), I get back "NM_201659".
You can now take your new list of accessions off to Batch Entrez, upload them and retrieve the results in GenBank format.
This is just one solution (relying on both BioMart and Batch Entrez working well); there are plenty of other potential ways to convert between IDs, including programmatic methods.
Thank you very much.. something I found strange is that NCBI entry only contain CDS not intron/exon information (eg:AT2G32460 (accNo:NM_128805) has 3 exons but not mentioned in .gb format). Why it is so? Is this because I am downloading from nucleotide database?
I want all features of genes. Any suggestions ??
I guess its not mandatory to have exons,UTR .. features in genbank formate. That would be a reason. Nice to know the BioMart way, but I think programmatic way of converting gff to genbank is the only solution for me
To make your question a bit more general, what you are asking for is a way to make a Genbank (or EMBL) file based on a GFF file and its associated FASTA sequence file. Solutions to that can be found here
Thank you very much.. something I found strange is that NCBI entry only contain CDS not intron/exon information (eg:AT2G32460 (accNo:NM_128805) has 3 exons but not mentioned in .gb format). Why it is so? Is this because I am downloading from nucleotide database? I want all features of genes. Any suggestions ??
That is strange. Looks like the record has the complete (i.e. includes non-coding) mRNA, but no "parts". I am not sure why.
I guess its not mandatory to have exons,UTR .. features in genbank formate. That would be a reason. Nice to know the BioMart way, but I think programmatic way of converting gff to genbank is the only solution for me