Question

How do experienced people look for full reference genomes?

16

Entering edit mode

10.9 years ago

John Smith ▴ 320

I am new to bioinformatics and just for the sake of this question, suppose that I am looking for the full genome sequence of Escherichia coli and Saccharomyces cerevisiae in FASTA format. I thought that these would be fairly easy to find online.

For Escherichia coli I was only able to find some links to the ncbi website but I could not find anywhere in the website a link to download the genome in FASTA format. For Saccharomyces cerevisiae, I found this FTP site linked by ensembl with the sequence for each chromosome. I suppose that I could concatenate all the files in order to obtain the full genome in one file but isn't there already one file with the entire sequence? I saw the top level files but those only appear for masked and soft masked DNA, I think it would be doable to work with those if modern sequencers masked the DNA automatically, do they?

So, in summary, how do experienced people find full reference genomes for those two organisms and possibly other common species? I have only been using Google and the only page that appears to provide FASTA files for some species is Ensembl.

reference-genome fasta • 11k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 10.9 years ago by John Smith ▴ 320

2

Entering edit mode

10.9 years ago

xb ▴ 420

You can first go with either ncbi or ensembl

Escherichia coli (*.faa files in subfolders)

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/

Saccharomyces cerevisiae

ftp://ftp.ensembl.org/pub/release-75/fasta/saccharomyces_cerevisiae/dna/

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by xb ▴ 420

Ram · Accepted Answer · 2014-06-12

In other words, where do I find and download reference genome sequences in fasta format and obtain additional reference genome files and information for my species of interest?

There are many possible sources. Each has their own merits.

Each model species generally has an active sequencing, assembly, annotation, analysis community. Often this community forms a consortium of some kind. For many species there will be a website (or more than one) dedicated to genome activities for this species. For example for yeast you have the Saccharomyces Genome Database. They have a genome download section that includes the genome in fasta format. Many different centers are involved in sequencing, assembling, and annotating various genomes. The associated websites vary considerably in their format, layout, information provided, support, etc.
UCSC annotates the genomes of many species. They provide downloads of the genomes of these species. For example, for yeast you would go: here. There will often be naming discrepancies between the consortia that created the genome assembly and the centers like UCSC that help to organize gene annotations of these genomes. The UCSC FAQ on Assembly Releases and Versions does a great job of laying out the naming conventions and discrepancies here. If you are working with multiple species and all of them have been annotated by UCSC, going to them is a nice option because they make the assemblies available in a consistent and organized fashion. In addition to downloading files you can perform complex queries of UCSC genome data using their Table Browser or MySQL Server.
Ensembl also annotates the genomes of many species. They provide a list of available species. Their FTP download site provides genome fasta files and many other things for each species. Note that Fungi and Plants are available in different branches of Ensembl. The Fasta file for Yeast can be found: here. Again, an advantage of using Ensembl where possible over the websites of individual species sequencing projects is that the provide a common interface with consistent file formats, etc. Ensembl also has a powerful API for programmatic access to a wealth of data.
NCBI is yet another central repository for genome data. They have an FTP site that allows you to download the complete genome for many organisms.

To summarize, the advantage of using UCSC, Ensembl or NCBI is a common interface and therefore the opportunity to automate updating or processing of multiple species. However, you may find that the most up-to-date assembly and gene annotations are sometimes available through individual model organism consortia websites.

score 5 · Accepted Answer · 2014-06-12

First place to check is Ensembl. They have separate pages for metazoa, fungi, etc. as well as the normal vertebrate page (which also contains S. cerevisiae). Each of the sites have a download page (click on "Downloads" at the top and then "Download data via FTP" on the right), where you can easily find fasta/GTF/etc. files.

BTW, a soft-masked genome is the genome. Soft-masking just means that predicted repeat regions are lower-case.

A second BTW is that you'll usually want the "toplevel" file.

score 5 · Accepted Answer · 2014-06-12

The short answer is:

this is a rare case where simply "Googling" will not help you very much; Google works using "popularity" and most people are not interested in "genome sequences fasta format" :)
"experienced people" know that there are a few key sites where genomic data can be found, as covered in previous answers - NCBI, Ensembl, UCSC, JGI
and then it is simply a case of exploring those sites and bookmarking the most useful areas for future reference - voilà, you are now experienced