Annotated files that contain the sequence of genes and CDSs?
1
0
Entering edit mode
8.2 years ago
unksci ▴ 180

I could download files annotated with the sequence of genes through NCBI's web-API, but could not find them on their ftp server.

Similarly assemblies in ftp.ncbi.nlm.nih.gov/genomes/refseq only appear annotated with the protein sequences (contrasting the rna.gbk.gz files which exist for /genomes/SOME_SELECTED_MODEL_ORGANISMS).

Do precomputed annotated files with the sequence of genes and CDSs exist, or can they be regenerated from individual files on NCBI's ftp server?

(For me, batch querying sequences through biomart leads to time-outs; While I have been happily parsing sequence-annotated genbank files through biopython, I want to extend to more species and thus avoid manual downloads.)

sequences ncbi cds gene • 1.8k views
ADD COMMENT
0
Entering edit mode
8.2 years ago
igor 13k

UCSC has RefSeq mRNA sequences from GenBank at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/refMrna.fa.gz (change hg19 to whatever genome you are interested in).

You can also use bedtools if you have gene coordinates:

bedtools getfasta -fi genome.fa -bed genes.gtf -fo out.fa

The -bed parameter can actually take BED/GFF/VCF files. See http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html

ADD COMMENT

Login before adding your answer.

Traffic: 1589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6