Question

Coverage Estimates for Masked Genome

0

Entering edit mode

8.1 years ago

Magpie101 • 0

Does anyone know how to remove repetitive elements from a genome masked with RepeatMasker? What I'm trying to do is get an estimate of the fraction of the reference genome covered with Illumina reads from two mapping runs with BWA: the first run not taking into account repetitive elements and the second with repetitive elements removed so that I have an estimate of the fraction covered of the 'mappable' regions of the genome.

The masked regions are in lower case while the rest is in upper case. I thought I could just use 'find and replace' to remove the lower case bases but I can't open genome size files in my text editor.

Hope this makes sense and thanks in advance.

masked repeats Assembly • 1.9k views

ADD COMMENT • link updated 7.2 years ago by Biostar 20 • written 8.1 years ago by Magpie101 • 0

score 1 · Answer 1 · 2016-11-04

1

Entering edit mode

8.1 years ago

Devon Ryan 105k

You can likely just download a hardmasked version of the genome from UCSC.

If you need to hardmask the file yourself, use tr [actg] [NNNN] < file.fa > hardmasked.fa or something like that (note that the chromosome names might get screwed up).

ADD COMMENT • link 8.1 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon, thanks for your reply. What I'm trying to do is completely remove the repetitive elements from the sequences. So for example in a soft-masked sequence GAATCggactTTAC becomes GAATCTTAC. With a hardmasked genome if I remove all N's then I'll also remove missing data.

ADD REPLY • link 8.1 years ago by Magpie101 • 0

0

Entering edit mode

Eek, that's a great way to produce a meaningless metric. I strongly encourage you to only hard-mask (and even that's extreme, since you can at least partially align uniquely to repeat regions). So while I could show you how to do what you want, I won't.

ADD REPLY • link 8.1 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon, I did manage to work it out simply using find and replace in the Linux Ubuntu system I'm using. Then came to the same conclusion as you :)

What I'm actually after is the effective genome size (or 'mappability' of the reference genome). I'm going to try out GEM. Unfortunately we have no bioinformaticians in my research group and rarely work on genome size datasets so it's all pretty new (and complex) to me.

Cheers

ADD REPLY • link 8.1 years ago by Magpie101 • 0