I am new to use human genome files. I found there are three versions of whole human genome files on Ensembl (ftp://ftp.ensembl.org/pub/release-78/fasta/homo_sapiens/dna/): toplevel, hard-masked toplevel, and soft-masked toplevel. The sizes are quite different. Could anyone please briefly describe to me the differences?
In addition, where is the gff file for download? The above site has a gff file with many regulatory features, e.g., histone methylations. I only need exon, intron, UTRs information. Thank you very much!
The sizes are only different because of file compression. "Masking" refers to manipulating a region in a sequence in some way. Typically, this is done with repeat and low complexity regions so that some aligners (e.g., blast) can avoid them. There are two ways to mask a region in a fasta file. Firstly, one can write its bases in lower case (e.g., "acgt") rather than upper case (e.g., "ACGT"). This is called soft-masking. Secondly, one could instead simply replace repetitive/low complexity regions with an N, termed hard masking. For most cases you'll want to use either the soft-masked or unmasked reference files. If you're using tools like BWA, or tophat or bowtie2 (i.e., almost anything meant to handle NGS data) then the results from using a soft-masked and unmasked reference will be identical (most of these tools simply ignore a base's case). However, should you ever need to use a tool that accounts for masking, then already having a soft-masked genome downloaded can be convenient. For that reason, I personally tend to download the soft-masked versions just so I don't have to bother ever downloading them later.
Your explanations are really helpful. I have just one more question: you mentioned that soft-masked or unmasked genome file should have not effects on mapping (using either tophat2 or bwa), so how's hard masked reference? What's the side-effects when using the hard-masked file? THANKS a lot!
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
biolab
★
1.4k
2
Entering edit mode
It's generally a bad idea to use hard-masked files. You're not going to get alignments to stretches of N, so any sequence that you do see that arose from such a region may incorrectly align elsewhere. So using a hard-masked genome is expected to decrease overall mapping quality. The only benefit is that you can map things a bit faster, but that's often a bad trade off.
Hi Devon,
Your explanations are really helpful. I have just one more question: you mentioned that soft-masked or unmasked genome file should have not effects on mapping (using either tophat2 or bwa), so how's hard masked reference? What's the side-effects when using the hard-masked file? THANKS a lot!
It's generally a bad idea to use hard-masked files. You're not going to get alignments to stretches of N, so any sequence that you do see that arose from such a region may incorrectly align elsewhere. So using a hard-masked genome is expected to decrease overall mapping quality. The only benefit is that you can map things a bit faster, but that's often a bad trade off.
Hi Devon,
Yes, it's bad to use the hard-masked file. Thanks a lot for your detailed answer.
I am very satisfied with the explanation. Thank you for the easy explanation.