I have a single fasta file genome that contains 40Mbps spread throughout ~30,000 separate sequences (contigs). About half is expected to be repetitive DNA. I am looking for a tool to either:
1) cut repeats from the original file and paste to a new fasta file
2) delete repeat regions from file
or 3) mask repeat regions (replace all repetitive sequences with N)
The first option is ideal, but for any of the three choices I want to be as liberal as possible with the definition of "repetitive DNA". I want to avoid any potential repeat at all costs. Losing good data is better than keeping repeat data in this scenario.
Note that I don't want to reduce the number of times a sequence is repeated, but I want to delete or mask every instance of that repeat so that it is not found a single time in my genome file.
Any suggestions for tools that will perform any of these tasks? Thanks!
Thanks, this looks good.