I mainly do research in farm animals, but am currently working on a comparative analysis that includes ENCODE data from mouse and human. I was expecting the genome files for these assemblies to be similar to the farm animal genomes, i.e. I expected the 3 GB human genome to be about a 3 GB fasta file. However it's 54 GB. Similarly, the mouse assembly is 12 GB. It seems like this is due to patches which add alternate sequences to the assembly. Is that right?
This is causing me to doubt a lot of what I'm doing currently. Will the same analysis pipeline I've been using for farm animals be suitable, or do I need to do something special to account for these patches? How similar are some of these alternate sequences? Will I need to deal with multi-mapped reads differently? Can I download genome assemblies without all these extra sequences, and how bad of an idea is that?
Also, a more technical question: Because of the size of the human assembly I've been having trouble getting bwa to index the genome in a reasonable amount of time. Is there somewhere I can download these index files?
Where?
Even with haplotypes etc that should not be the case.Top level genome file for human is about 1G compressed (from Ensembl).You can use Illumina's iGenomes site to download matched sequence, annotation and index bundles.
I downloaded them from Ensembl. Uncompressed it becomes 54 GB. Compression is very efficient because of so much repetition of a small alphabet.
Use the primary assembly, unless you have a need to worry about the patches etc.