Does anyone know how to remove repetitive elements from a genome masked with RepeatMasker? What I'm trying to do is get an estimate of the fraction of the reference genome covered with Illumina reads from two mapping runs with BWA: the first run not taking into account repetitive elements and the second with repetitive elements removed so that I have an estimate of the fraction covered of the 'mappable' regions of the genome.
The masked regions are in lower case while the rest is in upper case. I thought I could just use 'find and replace' to remove the lower case bases but I can't open genome size files in my text editor.
Hope this makes sense and thanks in advance.
Hi Devon, thanks for your reply. What I'm trying to do is completely remove the repetitive elements from the sequences. So for example in a soft-masked sequence GAATCggactTTAC becomes GAATCTTAC. With a hardmasked genome if I remove all N's then I'll also remove missing data.
Eek, that's a great way to produce a meaningless metric. I strongly encourage you to only hard-mask (and even that's extreme, since you can at least partially align uniquely to repeat regions). So while I could show you how to do what you want, I won't.
Hi Devon, I did manage to work it out simply using find and replace in the Linux Ubuntu system I'm using. Then came to the same conclusion as you :)
What I'm actually after is the effective genome size (or 'mappability' of the reference genome). I'm going to try out GEM. Unfortunately we have no bioinformaticians in my research group and rarely work on genome size datasets so it's all pretty new (and complex) to me.
Cheers