Generating random DNA sequence and paired-end alignment
3
1
Entering edit mode
8.3 years ago
ThePresident ▴ 180

Hello Biostars community,

I would like to simulate structural variants calling (i.e. genomic inversions, deletions and insertions) in order to understand some experimental results I am getting.

My idea is:

  1. Generate random DNA sequence of defined length (ex. 1 Mbp) with equal probability of A/T/C/G and store as fasta1
  2. Manually create genomic inversion/deletion/insertion/duplication etc. and store as fasta2
  3. Tricky part: Use the sequence from fasta2 and generate random paired-end data with fastq format (thus generating random but unique header, sequence of defined length derived from fasta2 with highest quality). These paired-end "reads" would also need to have a defined insert length (let's say 500bp with some standard deviation).

Since my knowledge in coding is basic-next-to-nothing, I am not sure if this is actually possible and have no idea if I should use R, Python or...? Any help or existing scripts would be highly appreciated.

Thank you in advance.

R Python Simulation • 3.2k views
ADD COMMENT
2
Entering edit mode
8.3 years ago
GenoMax 147k
  1. you could grab a bacterial genome from GenBank.
  2. You are going to do this manually
  3. randomreads.sh from BBMap. Guide thread.
ADD COMMENT
0
Entering edit mode

I am already dealing with bacterial genomes. They frequently have inversions/duplications etc. so I want to generate a random sequence which will (hopefully) be free of such structures. Thanks for No.3 :)

ADD REPLY
0
Entering edit mode

Everything works well, I didn't think I would pull this easily. All tools are already there :) BTW, any chance there is an automated generator of inversions/duplications and such (point 2)? I am doing it manually, and it's a little bit time consuming.

ADD REPLY
1
Entering edit mode

BBMap has a recent addition called mutate.sh, that I made for testing the sensitivity of contaminant removal when the contaminants are bacterial strains of the same species. It creates a mutant variant of a genome. For example:

mutate.sh in=ecoli.fasta out=mutant.fasta id=0.95

This will create a mutant version of the original genome with 95% identity to the original. The mutations are random, with no conserved locations (though I may add that option later), so any duplications or inversions in the original will (probabilistically) not be present in the mutant, since they would have received different mutations. However, the general structure will still be similar to a real bacteria. If you want to generate synthetic reads from a bacteria-like thing with no repeats or inversions, I suggest you run mutate on a real bacterial genome, then use randomreads.sh on the mutant genome. 95% identity should be sufficiently low (averaging a mutation every 20bp), though it depends on your specific needs.

ADD REPLY
1
Entering edit mode
6.3 years ago
Johan Zicola ▴ 70

I wrote a python script with the different functions you would need to test structural variation calling on either randomly generated fastq files or fastq files generated based on a given specified fasta file. Find the script and documentation on https://github.com/johanzi/fastq_generator

ADD COMMENT
0
Entering edit mode
8.3 years ago
Aerval ▴ 290

A review on various tools: http://www.nature.com/doifinder/10.1038/nrg.2016.57

ADD COMMENT
0
Entering edit mode

superawesome! Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6