What exactly is "unmappable regions"? My understanding from some google searches is that they are some short regions on the gene that are difficult to map. Is this correct? If so, why are there short region and long region, aren't they randomly splitted?
In the genome, there is a lot of what is called "repetitive DNA", these are sequences that appear many times throughout the genome. For example LINE1 and Alu are two types of repetitive sequences, that make up a large fraction of the human genome. Naturally, repetitive DNA is processed in sequencing assays like WGS and ChIP-seq, but aligners have a hard time figuring out where the read comes from as the sequence could have originated from many different places. The same thing happens when there are paralogous genes with very similar sequences, the aligner can't exactly distinguish where the sequence originated. This is why in short read sequencing, a lot of reads are discarded from the analysis as we don't know the true genomic origin of those reads. Long read sequencing mostly avoids this problem.
it depends on the alignment parameters you define - but i'd sat in most case a read would be aligned to multiple locations and assigned a lower "mapping quality" score
so instead of randomly put into one of the "predicted" genomic origin, we just discard all of them?
it depends on the alignment parameters you define - but i'd sat in most case a read would be aligned to multiple locations and assigned a lower "mapping quality" score