How can I mask repeats in Next Generation sequencing data? I several million NGS reads from a mammalian genome that was not sequenced yet. I would like to filter out those that have a significant hit against the RepBase or Repeatmasker databases. I would appreciate if anybody could give me more specific instructions.
could you clarify #2? you mean you'd filter reads that map to multiple places?
No, you can map the reads to the consensi sequences from known repeats obtained from RepBase or any other source filtering out those reads that match.
Dear JC,
I have some very basic questions about how to map reads to the Repbase consensi, Could you please give me details on? - What is a Repbase consensus? Is it distinct for each repeat family? Is it distinct over species? - Where can I find it/them for Human? - Do I build a regular bowtie2 index from this consensus file?
Many thanks,