Question

Coming from farm animal genomes, how do I deal with the large assemblies for mouse and human?

0

Entering edit mode

5.3 years ago

colin.kern ★ 1.1k

I mainly do research in farm animals, but am currently working on a comparative analysis that includes ENCODE data from mouse and human. I was expecting the genome files for these assemblies to be similar to the farm animal genomes, i.e. I expected the 3 GB human genome to be about a 3 GB fasta file. However it's 54 GB. Similarly, the mouse assembly is 12 GB. It seems like this is due to patches which add alternate sequences to the assembly. Is that right?

This is causing me to doubt a lot of what I'm doing currently. Will the same analysis pipeline I've been using for farm animals be suitable, or do I need to do something special to account for these patches? How similar are some of these alternate sequences? Will I need to deal with multi-mapped reads differently? Can I download genome assemblies without all these extra sequences, and how bad of an idea is that?

Also, a more technical question: Because of the size of the human assembly I've been having trouble getting bwa to index the genome in a reasonable amount of time. Is there somewhere I can download these index files?

alignment • 738 views

ADD COMMENT • link 5.3 years ago by colin.kern ★ 1.1k

0

Entering edit mode

3 GB human genome to be about a 3 GB fasta file. However it's 54 GB.

Where? ~~Even with haplotypes etc that should not be the case.~~ Top level genome file for human is about 1G compressed (from Ensembl).

Is there somewhere I can download these index files?

You can use Illumina's iGenomes site to download matched sequence, annotation and index bundles.

ADD REPLY • link 5.3 years ago by GenoMax 147k

0

Entering edit mode

I downloaded them from Ensembl. Uncompressed it becomes 54 GB. Compression is very efficient because of so much repetition of a small alphabet.

ADD REPLY • link 5.3 years ago by colin.kern ★ 1.1k

3

Entering edit mode

Use the primary assembly, unless you have a need to worry about the patches etc.

$ du -sh Homo_sapiens.GRCh38.dna.primary_assembly.fa
3.0G    Homo_sapiens.GRCh38.dna.primary_assembly.fa

ADD REPLY • link 5.3 years ago by GenoMax 147k