I'm trying to simulate metagenomic reads using Grinder. I want to simulate paired-end reads, so I need to select arguments for insert size and distribution. This is what they say:
-id <insert_dist>... | -insert_dist <insert_dist>... Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of DNA between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: 0
I understand their definition, but I don't quite understand how to decide the appropriate insert size. I'm trying to generate reads that I can "spike" into existing metagenomic reads for some in silico testing. I know what library prep kit was used to generate those metagenomic reads, and I know they were sequenced on a NextSeq generating 150-bp, paired-end reads. According to the Grinder definition, I need to choose an insert size that encompasses both reads (so, 300 bp) and a stretch of DNA between them. How can I know an appropriate value for the stretch of DNA between the reads?
Also, as a follow-up question, how does one decide whether to choose a uniform or normal distribution, and if you choose a normal distribution what's an appropriate standard deviation? I understand these distributions, just not how to choose between them or appropriate parameters. I'm missing biological/technical info related to sequencing, I think.
Thanks!
Just use the existing data to infer the insert size distribution of real data plus SD and use that. That's what I'd do.
But how do you infer the insert size when the data "in between the inserts" isn't in your dataset? I'm imagining mapping the metatranscriptomic reads to a reference database and then consolidating the number of bases between each mate pair. Then I could just get the mean and standard deviation of these numbers. For this method, I guess the first thing is I have no idea how to get that list of # of bases between each read 1 and read 2 pair...from the vcf file, I guess? And the second question would be: does it matter that it's metatranscriptomic reads mapping to a bunch of different things (bacteria, viruses, fungi) but I'm generating reads for one specific strain of a bacterial species? I guess I could "bucket" the mapped reads by phylum and look at the mean & std for each...and answer this question myself. I have no idea how to do these things...but is this along the lines of what you'd do, or am I over-complicating it?
TLEN is a field in the bam file. Don't overthink, it's a standard metric returned by all paired-end aligners.