I'm working on assemblies and wanted to create a mock metagenome assembly. My research is using biochem approaches to enrich eukaryotic DNA, followed by enriching computationally to get the genome.
Question: I would like to take a small, complete eukaryotic genome (approximately 20-40Mb), along with several complete bacterial genomes (~5Mb), and shred these all into 100-250bp (random) fragments.
That means each genome would be shredded 10-20 times randomly and independently so overlaps are available. All separate files would be merged into one mock fasta file simulating a NGS library that has been cleaned and ready for assembly.
I've tried searching for "genome shredding" and other derivatives for several weeks. Can anyone suggest software that would have this partially done, or some kind of framework for me to code this? This is the process I have thought of so far:
- Input file is one line of complete, assembled genome
- Each shuffle is composed of selecting a number between 100-250, taking that number of nucleotides and writing into new file with a fasta format of:
>random1
>ATATATATA (sequence)
>random2
>GCGCGCG (sequence)
- 10 separate fasta files of each organism are all cat > mocksequencing.fasta
I feel like this isn't too complicated or out of the norm for a lot of studies, so writing this myself is a bit redundant. Is this somewhere in BioPython Documentation? Thanks!
Thank you! Knowing what these are commonly referred to is a huge help. I will look into that.