I'm attempting to simulate reads from the human transcriptome to test an RNAseq pipeline using Heng Li's wgsim. This seems to be the most widely used read simulator and is pretty easy to use, however I've run into a problem which may be related to some of the reference sequences (human transcripts in this case) being quite short.
The problem is that wgsim does not produce the number of reads specified by the -N
flag. I'm using the following command line to produce single-end error and mutation free reads
wgsim -e0 -N 40000 -1 250 -d1 -s1 -r 0 -R 0 -X 0 in.fa out.fq /dev/null
However in wgsim does not produce the specified 40,000 reads. Sometimes it outputs a number close to the specified total, while others it outputs much fewer.
To work around this I've been overproducing reads then subsampling using seqtk, but I was wondering if anyone had seen this behavior before and had any kind of solution?
Thanks.
I see, that makes complete sense. However, I have one set of transcripts that consistently produces less than half the number of reads that I specify. However, other references made similarly, only output a few less than specified. Any idea what could be causing that type of behavior?
One situation that could lead to strange results would occur when the number of transcripts is of the same order of magnitude as N.