Hi all,
I think this is a very common issue in Exome sequencing. Whenever we do exome sequencing and variant calling, some of these genes are popped up more often than any others.
MUC (mucins), USP (Ubiquitin specific paptides), CYP genes, HLAs, TTN, and more.
Most of the time Mucins are observed because of paralogous alignment and some are due to their enormous gene length, but how does the community deal with the rest of these ? Do we simply ignore from further analysis ? Is there any list of such messy genes ? How does one decide a such gene is false positive?
I also found some blogs and a biostar question, talking about this issue.
I've been wondering about this. Aren't the mucins distinct enough that it's not due to paralogy? If I find a variant in MUC2 and BLAT 100 bases around it back to the genome, I find only 1 hit or 1 obvious best hit.
Is it because of incorrect assembly?