Question

Where Can I Download Human Reference Genome In Fasta Format? Hgref.Fa File

50

Entering edit mode

15.0 years ago

Biomed 5.0k

Is there a better way of downloading the human genome reference sequence in fasta format than downloading it from the UCSC site? BWA protocol asks for an index to be created from the human genome reference multi fasta so I want to get this. Thanks

[Edited for clarification in response to answers and comments:]

human fasta sequence bwa • 128k views

ADD COMMENT • link updated 8.5 years ago by Hajk-Georg Drost ▴ 180 • written 15.0 years ago by Biomed 5.0k

8

Entering edit mode

Please consider taking minimal effort finding the answer yourself before posting a question.

ADD REPLY • link 15.0 years ago by Michael Schubert ★ 7.1k

10

Entering edit mode

Please consider doing something more useful than posting this answers. I just waited a minute but I feel better. Thanks

ADD REPLY • link 8.3 years ago by giominas ▴ 100

0

Entering edit mode

As a further extension to this question refer to this question.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 14.4 years ago by Higherdefender ▴ 160

0

Entering edit mode

Relevant post: How do experienced people look for full reference genomes?

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 11.1 years ago by Malachi Griffith 20k

15

Entering edit mode

15.0 years ago

Pierre Lindenbaum 166k

You can get the fasta sequences for each chromosome here (human genome build 37)

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.0 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

How about this one?

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 11.9 years ago by skm770 ▴ 150

2

Entering edit mode

no, you can just concatenate those file into one unique file.

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

I used

$ cat file 1 file2 filen > hg18.mfa

to create the multifasta file but I wan not sure with the ordering of ChrX,Y and M. My current order is Chr1-22,chrX,ChrY andChr M. Will this ordering have any affect downstream in the analysis? Is there a standard order that is different than this?

Thanks

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 15.0 years ago by Biomed 5.0k

0

Entering edit mode

Thanks you for your help I elaborated a little on your initial input.

ADD REPLY • link 15.0 years ago by Biomed 5.0k

0

Entering edit mode

the files come in one file per chromosome format, I want to use them in one multifasta file as input to BWA. Do I simply concatenate these chr fasta files into one big fasta file to get the multi fasta file? Or is there something else to it? Any ideas?

ADD REPLY • link 15.0 years ago by Biomed 5.0k

0

Entering edit mode

Thanks you, I guess I will have more questions on this as I go but this site and people like you are a great help.

ADD REPLY • link 15.0 years ago by Biomed 5.0k

0

Entering edit mode

Will this ordering have any affect downstream in the analysis?: no

ADD REPLY • link 15.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

"Chromosome M" is the mitochondrial DNA sequence. Depending on the analysis you're doing you should not include it.

ADD REPLY • link 15.0 years ago by Paulo Nuin ★ 3.7k

11

Entering edit mode

15.0 years ago

Biomed 5.0k

Using an rsync command to download the entire directory:

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

This directory is where all fasta files one file per chromosome are located in .gz(zipped) format plus other useful files for human reference genome dataset. Original web site.

ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/chromosomes/README.txt

unix specific, gunzip the files

$ cat file1.fa file2.fa etc >multifastafile.fa will get you the reference human genome

also see this discussion about this very same topic.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.0 years ago by Biomed 5.0k

7

Entering edit mode

14.6 years ago

Jonathan Manning ▴ 630

Just for the record (since I'm always searching for these links myself)...

This is the canonical source for GRCh17, which hg19 is based upon (and should be identical to).

1000 Genomes also has a pre-concatenated multi-fasta reference suitable for use with most next-gen aligners out of the box here.

This file does have an "alternate" chrM, and includes all the "random" contigs. There's a README explaining the method of construction in that folder. YMMV.

For those in Europe (they now have a US mirror, too), try Ensembl for a local snapshot of the reference assembly.

So you can anticipate the download time and storage space required, the total size for each of these variations is ~3GB uncompressed, ~750MB compressed.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 14.6 years ago by Jonathan Manning ▴ 630

2

Entering edit mode

12.4 years ago

Tulip Nandu ▴ 90

I would recommend downloading from ensembl database. Here is the link: http://www.ensembl.org/info/data/ftp/index.html

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 12.4 years ago by Tulip Nandu ▴ 90

1

Entering edit mode

8.5 years ago

Hajk-Georg Drost ▴ 180

I know that this question is already 6 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve the human reference genome from several database sources one can simply type:

# download human reference genome from NCBI RefSeq
biomartr::getGenome(db  = "refseq", organism = "Homo sapiens")

or

# download human reference genome from NCBI Genbank
biomartr::getGenome(db  = "genbank", organism = "Homo sapiens")

or

# download human reference genome from ENSEMBL
biomartr::getGenome(db  = "ensembl", organism = "Homo sapiens")

This way, users can use the same command to retrieve reference genomes from different databases. Each database has its own custom gene identifier and thus, it should always be clear which reference genome has been used to perform subsequent analyses.

For more detailed information please consult the Genomic Sequence Retrieval vignette.

The getGenome() function will then generate a log file that stores the following information:

File Name: Homo_sapiens_genomic_refseq.fna.gz

Organism Name: Homo_sapiens

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_genomic.fna.gz

Download_Date: Sat Oct 22 12:41:07 2016

refseq_category: reference

genome assembly_accession: GCF_000001405.35

bioproject: PRJNA168

biosample: NA

taxid: 9606

infraspecific_name: NA

version_status: latest

release_type: Patch

genome_rep: Full

seq_rel_date: 2016-09-26

submitter: Genome Reference Consortium

Thus, you will always know with which reference genome and with which genome version you are working.

I hope that this will help to improve the reproducibility of many studies.

Alternatively, the biomartr package also provides functions for retrieving corresponding coding sequence - getCDS(), protein sequence - getProteome(), and annotation files - getGFF().

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 8.5 years ago by Hajk-Georg Drost ▴ 180

Ram · Accepted Answer · 2010-12-16

25

Entering edit mode

14.6 years ago

lh3 33k

The version used by the 1000 genomes project is recommended. The mitochondrial genome in the g1k version is the most widely used rCRS. The chromosomes and contigs are concatenated, so it is less likely to make mistakes (people frequently concatenate all sequences including different haplotypes from the same region).

We have seen a lot of complications caused by different chromosome names (chr1 vs. 1) or different ordering (chr2 before chr10 or after). It is true that which b37 version to use does not matter too much, but converging to something close to a standard would reduce a lot of unnecessary works for everyone.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 14.6 years ago by lh3 33k

3

Entering edit mode

using the g1k version is highly recommended.

ADD REPLY • link 14.2 years ago by lh3 33k

1

Entering edit mode

random and Un are already in the g1k version. Usually you would not want to map to haplotypes as you will lose most of variants.

ADD REPLY • link 14.2 years ago by lh3 33k

0

Entering edit mode

I'm very interested in this opinion, since we have moved from the hg18 reference that came with our SOLiD sequencer to the hg19 we manually built by concatenating all chromosome chunks from UCSC.

Did I understand you right that we would be less error prone if we use g1k reference genome rather than UCSC's? because the main problem we see is how to efficiently deal with chrUn, random and haplotypes. chrUns should definitely be stored, but are you saying that random chunks and/or different haplotypes shouldn't be concatenated on that single multifasta file?

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 14.2 years ago by Jorge Amigo 14k

0

Entering edit mode

because if you download the single hg19 file from UCSC and convert it to fasta using twoBitToFa you end up with a multifasta file containing all chromosomes, including those haplotypes, random and chrUn. since g1k seems to include only those later unmapped supercontigs, is there any reason or recommendation to leave the rest of the files aside?

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 14.2 years ago by Jorge Amigo 14k

0

Entering edit mode

thanks a lot for the advice

ADD REPLY • link 14.2 years ago by Jorge Amigo 14k