Question

How To Mask Repeats In Ngs Data.

1

Entering edit mode

12.9 years ago

Daniel ▴ 40

How can I mask repeats in Next Generation sequencing data? I several million NGS reads from a mammalian genome that was not sequenced yet. I would like to filter out those that have a significant hit against the RepBase or Repeatmasker databases. I would appreciate if anybody could give me more specific instructions.

repeats next-gen • 9.0k views

ADD COMMENT • link updated 8.7 years ago by Dattatray Mongad ▴ 380 • written 12.9 years ago by Daniel ▴ 40

score 4 · Answer 1 · 2012-06-22

4

Entering edit mode

12.9 years ago

JC 13k

I simply filter reads from repetitive sequences using 2 approaches:

1) simple repeats and low complexity sequences can be filter with DUST or I just compute the complexity of the sequence using entropy or compression ratio.

2) interspersed repeats can be filters if you map the reads to the RepBase consensi with Bowtie, BWA or Blat (with -fastMap), this step can filter millions of reads in a few minutes.

If you are expecting a lot of repetitive sequences (as in genome genome sequencing), I strongly suggest to filter first before mapping/assembling, otherwise it doesn't gives you any advantage.

ADD COMMENT • link 12.9 years ago by JC 13k

0

Entering edit mode

could you clarify #2? you mean you'd filter reads that map to multiple places?

ADD REPLY • link 11.2 years ago by brentp 24k

0

Entering edit mode

No, you can map the reads to the consensi sequences from known repeats obtained from RepBase or any other source filtering out those reads that match.

ADD REPLY • link 11.2 years ago by JC 13k

0

Entering edit mode

Dear JC,

I have some very basic questions about how to map reads to the Repbase consensi, Could you please give me details on? - What is a Repbase consensus? Is it distinct for each repeat family? Is it distinct over species? - Where can I find it/them for Human? - Do I build a regular bowtie2 index from this consensus file?

Many thanks,

ADD REPLY • link 7.8 years ago by pyKey ▴ 70

score 2 · Answer 2 · 2012-06-22

2

Entering edit mode

12.9 years ago

Leonor Palmeira 3.9k

You could either:

directly repeatmask your data : http://www.repeatmasker.org/
map your data against a close mammalian genome and cross the matching positions with the repeatmasker positions

ADD COMMENT • link 12.9 years ago by Leonor Palmeira 3.9k

score 2 · Answer 3 · 2012-06-22

2

Entering edit mode

12.9 years ago

Ian 6.1k

Just a thought (i.e. not sure it is practical/possible). But if you could obtain the repetitive sequences from RepBase you could use them as the reference sequences for an NGS sequence aligner, e.g. Bowtie. Any uniquely mapping reads could be excluded from your sample.

ADD COMMENT • link 12.9 years ago by Ian 6.1k

score 1 · Answer 4 · 2014-02-21

1

Entering edit mode

11.2 years ago

Biojl ★ 1.7k

A very simple and easy to use tool is SEG. It will replace the repeats and/or LCR in your sequences for 'XXXX'

ADD COMMENT • link 11.2 years ago by Biojl ★ 1.7k

1

Entering edit mode

This is for protein sequences. It is also for masking mathematical repeats, not the type OP is interested in (which is identifying sequences using a reference library).

ADD REPLY • link 11.2 years ago by SES 8.6k

score 1 · Answer 5 · 2016-08-29

1

Entering edit mode

8.7 years ago

Dattatray Mongad ▴ 380

You can use "tantan" which is used by LAST in orderto mask genomes before comparing

ADD COMMENT • link 8.7 years ago by Dattatray Mongad ▴ 380