Hi, Is there a tool out there which will allow me, in any way, to determine how similar two fastq file sets (paired-end) are? It could be any metric like number of identical reads etc. or any other metric which can be relevant in this case.
I need this to diagnose the reason behind low agreement of variant calls between two identical runs: if the fastqs are quite similar to each other, then it was the variant-calling pipeline and not the upstream bench-work.
Thanks!
May be the bamCorrelate from deeptools would be useful ? And you can inspect the bam files in genome browser where the variants do not agree. Why do you want to compare the fastq files instead of bam files ?
I thought that comparing the fastq files will really narrow down the problem up to the sequencing steps (with out any downstream step having any effect). But I can compare the bams as well. Thanks!
That is a tough question. If the variant calls are in low agreement then you know there is a problem with the data. Don't know if it can be sorted out by finding how "similar" the two files are. Has the variant calling been repeated to rule out some issue with that.
I thought that by comparing the fastq files I can rule out any effect of the downstream steps (QC, alignment, VC etc) which are different between the two sets can really narrow down the problem up to the sequencing steps. Thanks!
Do you mean that the same library has been sequenced on two different lanes of the same flow cell or in two different flow cells? Either way, the variability between lanes or flow cells should be very small. Is the number of reads between the two runs comparable (if one has many more reads you could pickup more SNPs)? I agree with John, in this case FastQC seems to be the easiest way to check how similar the fastqs are.