Question

How to mask all repeats and low complexity regions using RepeatMasker?

0

Entering edit mode

6.0 years ago

zwz110 • 0

I have a genome sequence in fasta format. I want to have a soft-masked genomic DNA.

After Google, I find I should do the follow thing: All repeats and low complexity regions should be replaced with lower-cased versions of their nucleic base. I have installed the RepeatMasker in Linux. I'm new to RepeatMasker. RepeatMasker manual says " Default settings are for masking all type of repeats in a primate sequence.", but I'm not sure it suits me.

I'm so confused, and I don't know what should I do, so anyone can tell me how to do it? Thank you!

Repeatmasker • 2.8k views

ADD COMMENT • link updated 6.0 years ago by 2nelly ▴ 350 • written 6.0 years ago by zwz110 • 0

score 0 · Answer 1 · 2019-07-24

0

Entering edit mode

6.0 years ago

2nelly ▴ 350

Hi zwz110,

you can directly download any masked genome from UCSC or NCBI golden path

ftp://ftp.ncbi.nlm.nih.gov/genomes/

masked regions are represented with lower case.

for instance the masked human chromosome 1 of GRCh38 assembly is here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/CHR_01/hs_ref_GRCh38.p12_chr1.mfa.gz

Then see this post: Can I Convert Fasta Lowercase Bases To 'N'?

ADD COMMENT • link 6.0 years ago by 2nelly ▴ 350

0

Entering edit mode

Thank you! I got it. And I want to know more detail information about it, for example how they do the soft-masking using RepeatMasker and what's the parameter they use. That's to say, I want to learn what happens when the sequence Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gzbecomes the sequence Oryza_sativa.IRGSP-1.0.dna_rm.toplevel.fa.gz. If you know, can you tell me?

ADD REPLY • link 6.0 years ago by zwz110 • 0