I have generated synthetic reads using wgsim as follows:
wgsim -1 100 -2 100 -r 0 -R 0 -X 0 -e 0 GCA_900186885_1.fa reads_1.fastq reads_2.fastq
This created paired-end reads of length 100 bp with no errors, mutations, or indels. I would now like to subset those reads so as to produce reads with variable levels of coverage (i.e., depth of coverage). In other words, I would like some reads to have 1x coverage, some to have 2x, some 3x and so on until 30x coverage. I would also like to randomly choose which reads get which level of coverage.
My ultimately goal is to produce a synthetic set of reads with variable levels of coverage that should in theory map perfectly to the reference genome from which they were derived. I have searched far and wide, to no avail, for a program that can do such subsetting as I described. Does anyone know of some simple bash, R or python coding that could pull this off? Any insight would be much appreciated.
map your reads to a bam file , get the coverage for the RAW bam, and the run
samtools view -s
according to your needs: http://www.htslib.org/doc/samtools-view.htmlI'm not sure mapping before subsetting will work if unmapped reads get thrown out. I'd like to avoid that step if at all possible. Surely there's an easier way to produce synthetic reads with variable (known) known levels of coverage.