Hello,
I have some troubles calculating the number of reads that I should expect for a target sequence (for instance a trasposon) integrated into the human genome. That is: how many reads should I expect to map to my target sequence and confirm the presence of the target?
Assuming: 1) a pre-calculated coverage of 20, 2) a target region of 1000 bp and 3) a fixed length read of 150 bp and using the formula C=NL/G i get:
N=CG/L=20 x 1000 / 150 = 133 reads
this looks a bit too many reads. Or should I calculate using the whole human sequence, since the target is integrated into it? in that case, I get:
N=20 x 3 000 000 000 / 150 = 400 000 000 reads
that is clearly wrong.
My question is, therefore: how do I calculate the coverage in general and for integrated sequences in particular? Thank you
But from where do you get this pre-calculated coverage of 20?
Probably shooting for a 20x coverage.
the data was given with this coverage but based on the human genome. I would like to estimate how many reads should I expect for the trasposon
Is the sequence for that transposon so specific that you don't expect to get any alignments outside that 1kb?
well, the sequence is not human as such, but otherwise there is nothing special about the target; the reads should align more or less at the same average for both human and transposon. So shall I expect 20x coverage also for the transposon?
If you are sure the transposon is only in that one location (seems a bit implausible) then that may be a reasonable assumption. As long as there is no strange bias in transposon sequence compared to human genome.
OK, otherwise is the formula correct? should I expect 133 reads covering the trasposon?