Question

NGS Sample contamination simulation

0

Entering edit mode

8.3 years ago

MAPK ★ 2.1k

I am trying to simulate sample contamination for different level of dilution for NGS samples. Suppose I have two bam files for SampleA and SampleB. I want to generate 5 contaminated samples at dilution of 10%, 20%, 30%,40% and 50% of those two samples. I understand that I should extract reads from one of the two bam files at the given dilution percentage and reassign to the other bam file, but I don't know exactly how to do this. Can someone please explain me the procedure? Thanks

NGS • 1.9k views

ADD COMMENT • link updated 8.3 years ago by Devon Ryan 104k • written 8.3 years ago by MAPK ★ 2.1k

score 2 · Accepted Answer · 2016-06-14

2

Entering edit mode

8.3 years ago

Devon Ryan 104k

I'm not sure at what level of complexity to lay out the procedures, so let me know if the following doesn't suffice.

Generate a large amount of sequence from both samples A and B.
Shuffle the order of both files, since typically the read generators generate reads in sorted order.
Take the first 90% of one sample and concatenate on the first 10% of the other (assuming you generated equal numbers of reads.

You could do this using a random number generator too, but this simpler procedure will likely suffice. For shuffling reads, have a look here: Randomize Read Order In Multigbp Fastq File? Note that handling paired-end reads is a bit more complicated, though only due to the shuffling (you can find commands for that here as well).

ADD COMMENT • link 8.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon. Could you please explain a bit more on point 2 (Shuffle the order of both files, since typically the read generators generate reads in sorted order). What would be the process of selecting the reads based on the chromosome position (or do I even need to consider the chromosome positions?)? Say I have read from chr2:220333-chr2:24444432 of SampleA and want to shuffle in SampleB, how can I do this in a right way?

ADD REPLY • link 8.3 years ago by MAPK ★ 2.1k

0

Entering edit mode

You don't need to perform any selection. If you just want to look at a specific region then restrict the reads generated to only arise from that region (if nothing else, make a fasta file from only that region and generate reads from it).

ADD REPLY • link 8.3 years ago by Devon Ryan 104k