Question

Short Read Simulator For Cnv Indel?

0

Entering edit mode

13.4 years ago

Pascal ★ 1.5k

Hi

Is there a way to simulate short reads with CNV indel (1-50kb)?

I've read wgsim manual for instance but it looks to generate small indels only.

Regards.

simulation cnv next-gen sequencing • 5.9k views

ADD COMMENT • link updated 13.4 years ago by Stefano Berri 4.4k • written 13.4 years ago by Pascal ★ 1.5k

0

Entering edit mode

Some questions for you: - Copy number variation (CNV) or indel? - Does it make sense genetically to have an insertion of 50kb? - How many cases of large indels have been described in which region of the genome? - Large indels are most likely not neutral, how many of these per genome could exist at the same time? (possibly max. 1)?

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

Further questions: Diploid or haploid genome/simulation?

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

So CNV indel is supposed to mean to simulate a gene duplication or deletion?

ADD REPLY • link 13.4 years ago by Michael 55k

score 3 · Answer 1 · 2011-11-28

3

Entering edit mode

13.4 years ago

Stefano Berri 4.4k

I second Michael. Produce an "altered" genome and use wgsim from there. However, to make it realistic, you will have to make two copies of each chromosome and introduce a CNV in one of them(*). Do NOT enter random sequences, but, if you need to make an amplification/duplication enter a sequence copied from somewhere else, so that you will be able to find wich regions have been duplicated.

When I did some simulations, I found that version 2.6 is better than 3.0 as 3.0 seems to have some sort of "chromosome specific" bias. I was gettin uneven coverage...

(*) be careful though. wgsim will produce mutation and small indels from each crhomosome, so th frequency of them will be twice as much (because you have twice as many chromosomes) but each mutation will be either heterozygous or, apparently, in 25% of your reads (ass opposed to 100% or 50%)

ADD COMMENT • link 13.4 years ago by Stefano Berri 4.4k

0

Entering edit mode

Btw: for simulating realistic loci also for X/Y chromosome, these should not be duplicated, only the autosome. If variation on the on the sex chromosomes is required they should be treated separately.

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

First, sorry to come back to you on this issue that late. If I understood you correctly Stefano, I should do, in order to introduce a CNV in chromosome 20 (for instance): 1) take the reference FASTA file for chr20, 2) copy it, 3) introduce in one of the two files my CNV, 4) process both FASTA files with wgsim to create reads corresponding to the diploid genome. Is that what you mean?

ADD REPLY • link 13.3 years ago by Pascal ★ 1.5k

0

Entering edit mode

yes. I would recomend to keep all the sequences together in a single fasta file. In this way the read number will be proportional to the length of the new chromosomes. Otherwise you will have relatively less reads from the chr20+insert (unless you specify for each fasta the right num of sequences.)

Unless you are planning to find reads across the breakpoint, you can just add the fasta of the amplified segment.

Hope this help

ADD REPLY • link 13.3 years ago by Stefano Berri 4.4k

0

Entering edit mode

Hi all.

I'm trying to test tools to generate such genomes with SVs. I'm already testing SCNVSim, but I'd like to try other tools. Any other options now in 2015?

Thanks!

ADD REPLY • link 9.7 years ago by Leandro Lima ▴ 970

score 1 · Answer 2 · 2011-11-28

1

Entering edit mode

13.4 years ago

Michael 55k

A read simulator with this feature is not required (maybe it exists anyway, but who cares?). You can simply modify the input sequence, the reference genome. Draw N (not much more than 1 makes sense to me) random chromosome location (chrom., position), draw the desired indel length from e.g. Poisson distribution. Delete it from your fasta sequence, in case of insertion, insert random sequence at that point. Give this file as input to your read simulator. From your answer to my comments you can get the right parameters for a little script that will do it.

What do you want to do with it, btw? These variations will be very easy to detect by lack of coverage in the region anyway, given your coverage is high enough.

ADD COMMENT • link 13.4 years ago by Michael 55k

1

Entering edit mode

No, your question is not pointless, it is just that it would be very easy to write a script that does this. That, given you need more than 1-2 inserts. I guess, opening a FASTA file and editing a position (and record it exactly) for me would still be faster than writing a small script. Maybe I'm going to write an example for this application in R though for fun.

So, do you need help with such a script?

ADD REPLY • link 13.4 years ago by Michael 55k

0

Entering edit mode

Thanks Michael. I understand by your answer and comment that my question is pointless. Although I aim to compare several SV detection algorithms from small to large indels, in term of accuracy, speed, mem/cpu consumption, etc. I though that simulating a dataset with different events (not only small ones) could be useful for that purpose. I can of course insert manually events as you described by this is not very convenient and prone to errors. Thanks for your answer it helps me a lot to understand what I am doing :-)

ADD REPLY • link 13.4 years ago by Pascal ★ 1.5k

0

Entering edit mode

No please!! Don't spend more time on this: you already helped me a lot, really! I would feel very confused if you dedicate more of your time. So if I understand well, the idea is to edit the fasta file of a genome reference, copy n times a portion of the sequence (this portion length is inferior to read length I guess) and then generate reads with a reads simulator. Right?

ADD REPLY • link 13.4 years ago by Pascal ★ 1.5k

0

Entering edit mode

Yes, that's what one could do (and I, personally would do that only if the number n was <=3, otherwise I would write a script). Doesn't sound like rocket science, and I agree it is not a very clean solution. Actually it's quick and dirty and you need to record exactly, where you put that sequence. For producing heterozygous loci, please refer to Stefanos answer.

ADD REPLY • link 13.4 years ago by Michael 55k