Question

Annotated files that contain the sequence of genes and CDSs?

0

Entering edit mode

8.7 years ago

unksci ▴ 180

I could download files annotated with the sequence of genes through NCBI's web-API, but could not find them on their ftp server.

Similarly assemblies in ftp.ncbi.nlm.nih.gov/genomes/refseq only appear annotated with the protein sequences (contrasting the rna.gbk.gz files which exist for /genomes/SOME_SELECTED_MODEL_ORGANISMS).

Do precomputed annotated files with the sequence of genes and CDSs exist, or can they be regenerated from individual files on NCBI's ftp server?

(For me, batch querying sequences through biomart leads to time-outs; While I have been happily parsing sequence-annotated genbank files through biopython, I want to extend to more species and thus avoid manual downloads.)

sequences ncbi cds gene • 1.9k views

ADD COMMENT • link updated 8.7 years ago by igor 13k • written 8.7 years ago by unksci ▴ 180

score 0 · Answer 1 · 2016-09-03

UCSC has RefSeq mRNA sequences from GenBank at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/refMrna.fa.gz (change hg19 to whatever genome you are interested in).

You can also use bedtools if you have gene coordinates:

bedtools getfasta -fi genome.fa -bed genes.gtf -fo out.fa

The -bed parameter can actually take BED/GFF/VCF files. See http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html