Hello Biostars community,
I would like to simulate structural variants calling (i.e. genomic inversions, deletions and insertions) in order to understand some experimental results I am getting.
My idea is:
- Generate random DNA sequence of defined length (ex. 1 Mbp) with equal probability of A/T/C/G and store as fasta1
- Manually create genomic inversion/deletion/insertion/duplication etc. and store as fasta2
- Tricky part: Use the sequence from fasta2 and generate random paired-end data with fastq format (thus generating random but unique header, sequence of defined length derived from fasta2 with highest quality). These paired-end "reads" would also need to have a defined insert length (let's say 500bp with some standard deviation).
Since my knowledge in coding is basic-next-to-nothing, I am not sure if this is actually possible and have no idea if I should use R, Python or...? Any help or existing scripts would be highly appreciated.
Thank you in advance.
I am already dealing with bacterial genomes. They frequently have inversions/duplications etc. so I want to generate a random sequence which will (hopefully) be free of such structures. Thanks for No.3 :)
If you need random sequence then use: Generate Random Dna Sequence Data With Equal Base Frequencies
Or two online sites:
http://users-birc.au.dk/biopv/php/fabox/random_sequence_generator.php
http://www.faculty.ucr.edu/~mmaduro/random.htm
Everything works well, I didn't think I would pull this easily. All tools are already there :) BTW, any chance there is an automated generator of inversions/duplications and such (point 2)? I am doing it manually, and it's a little bit time consuming.
BBMap has a recent addition called mutate.sh, that I made for testing the sensitivity of contaminant removal when the contaminants are bacterial strains of the same species. It creates a mutant variant of a genome. For example:
This will create a mutant version of the original genome with 95% identity to the original. The mutations are random, with no conserved locations (though I may add that option later), so any duplications or inversions in the original will (probabilistically) not be present in the mutant, since they would have received different mutations. However, the general structure will still be similar to a real bacteria. If you want to generate synthetic reads from a bacteria-like thing with no repeats or inversions, I suggest you run mutate on a real bacterial genome, then use randomreads.sh on the mutant genome. 95% identity should be sufficiently low (averaging a mutation every 20bp), though it depends on your specific needs.