Entering edit mode
6.1 years ago
pegeot.henri
•
0
Hello, I am performing exome comparisons between different technologies and different sequencing conditions.
I have downsampled my dataset for a fair comparison at 5M 10M ..... 60M (M = million of reads).
Amongst other things, I compare notably exomes with 2 x 150 bp vs 2 x 75 bp. But in both cases I have normalized at 5M reads. But should I not compare 2 x 150bp downsampled at 5M and 2 x 75 bp downsampled at 10 M for a fair comparison ?
What do you think ?
Thank you ! Henri
Hello pegeot.henri ,
why do you think downsampling in necessary for a comparison? What do you try to compare?
fin swimmer
I am interested in comparing standard QC metrics. Amongst other things, I am particularly interested in the target Coverage efficiency as a function of number of reads. Downsampling is usually done in such cases as it allows a comparaison with the same amount of sequencing.
Downsampling to roughly the same number of bases is probably not how this needs to be done. 150 bp read (or a large part of it) is likely to map more specifically than a 75 bp read.
What sort of difference did you see in the alignments before you did any downsampling?
To give more context I work in an hospital and I am looking for the most performant sequencing kit. This will lead to the choice of a technology for routine use for patients analysis. One of the key metrics I want to investigate is the target coverage efficiency for the same sequencing effort.
Concerning 2x150 bp and 2x75bp, after downsampling, without even looking at the alignment, I see that the fastq size for 2x150 bp sequencing is twice bigger than for 2x75bp. Which can be expected. The comparison will be biased if go further. Should I double up the number of reads for 2x75 bp ? I am fine if there are difference in the alignment between 2x75 and 2x150 bp. For me, this is a part of the technological comparison.
Hello pegeot.henri ,
comparing file sizes is never a good idea, because they tell you nothing. Doubling the file size doesn't mean automatically you have twice as much information.
If you are looking for the best technology/sequencing kit, you should have a look on how even is the coverage distribution across your target regions to get a sense for how many samples you can sequence in parallel.
But the most important thing you should do, is comparing the results of your variant calling pipeline.
fin swimmer