Which one of the following is a good source to download a reference genome to be used for RNA-seq analysis?
- UCSC
- NCBI/GRC
- iGenome
- Ensembl
What are the things to be kept in mind while downloading one?
Which one of the following is a good source to download a reference genome to be used for RNA-seq analysis?
What are the things to be kept in mind while downloading one?
It is not an easy task to select not only reference genome, but also ecosystem of annotations and additional information. First, start with this paper by Zhao & Zhang
Personally, I prefer UCSC for human, just because of ENCODE annotations. For, other species I prefer Ensembl, because it is the easiest one to use (one page with all downloads including .fa, .gtf, .gff and easy to use data warehouse - biomart).
No matter what source you choose, try genomepy to download your genomes. Will include chromosome sizes, a BED file with gaps and, optionally, gene annotation. Works for Ensembl, UCSC and NCBI. Automated, scriptable and reproducable!
As the reference genome comes from the GRC, it should not matter where you get your genome from. I assume you are working with human. What I do is the following: Be sure to download the entire genome, so the primary chromosomes, unplaced and random contigs, but exclude alternative haplotypes for standard analysis. In case of human hg38, download the hg38.fa.gz and the file with the chromSizes from here, decompress, use samtools faidx
to index and then use this command to get your final reference genome.
grep -v '_alt' hg38.chrom.sizes | xargs samtools faidx hg38.fa > hg38_noALT.fa
This will exclude the alternative haplotypes. From there on, index the fasta with the downstream tool of choice.
Yes, the annotations can be/are different and the choice can impact the outcome. Have a look here. This is an ongoing discussion which one is better with probably the usual answer: "it depends on your task". I use Gencode, simply because some genes I was interested in were not included in RefSeq.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
FYI, Ensembl has ENCODE data including GENCODE annotations. The advantage of Ensembl over other resources is that the data is better organized/integrated and the combination of local MySQL database + perl API is very powerful.