Hello,
When I download release 95 repeats soft masked file "ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.toplevel.fa.gz" it is ~1 Gb, however if I decompress it, it becomes 54 Gb.
This is curious as the same soft repeat masked mouse genome decompressed is 2.7 Gb. Any idea why the human genome is so large and if there is any tool to reformat the fasta file into a smaller one?
Thanks, A
That makes a lot of sense! So if I just download the "primary_assembly" I would have a smaller decompressed file size only at the expense of haplotypes, is that a fair prediction? Thanks
Yes, that is absolutely right.