Similarity between two sets of paired-end fastq files
3
1
Entering edit mode
8.7 years ago
Floydian_slip ▴ 170

Hi, Is there a tool out there which will allow me, in any way, to determine how similar two fastq file sets (paired-end) are? It could be any metric like number of identical reads etc. or any other metric which can be relevant in this case.

I need this to diagnose the reason behind low agreement of variant calls between two identical runs: if the fastqs are quite similar to each other, then it was the variant-calling pipeline and not the upstream bench-work.

Thanks!

fastq paired-end RNA-Seq • 4.4k views
ADD COMMENT
1
Entering edit mode

May be the bamCorrelate from deeptools would be useful ? And you can inspect the bam files in genome browser where the variants do not agree. Why do you want to compare the fastq files instead of bam files ?

ADD REPLY
0
Entering edit mode

I thought that comparing the fastq files will really narrow down the problem up to the sequencing steps (with out any downstream step having any effect). But I can compare the bams as well. Thanks!

ADD REPLY
0
Entering edit mode

That is a tough question. If the variant calls are in low agreement then you know there is a problem with the data. Don't know if it can be sorted out by finding how "similar" the two files are. Has the variant calling been repeated to rule out some issue with that.

ADD REPLY
0
Entering edit mode

I thought that by comparing the fastq files I can rule out any effect of the downstream steps (QC, alignment, VC etc) which are different between the two sets can really narrow down the problem up to the sequencing steps. Thanks!

ADD REPLY
0
Entering edit mode

two identical runs

Do you mean that the same library has been sequenced on two different lanes of the same flow cell or in two different flow cells? Either way, the variability between lanes or flow cells should be very small. Is the number of reads between the two runs comparable (if one has many more reads you could pickup more SNPs)? I agree with John, in this case FastQC seems to be the easiest way to check how similar the fastqs are.

ADD REPLY
3
Entering edit mode
8.7 years ago

Maybe you could try commet? It was designed for metagenomics, but it allows you to compute a distance between two fastq files.

ADD COMMENT
2
Entering edit mode
8.7 years ago
John 13k

Have you run FastQC on both to see if there are contamination/sequence quality issues?

ADD COMMENT
0
Entering edit mode
8.7 years ago
apelin20 ▴ 480

discoSNP... allows to compare SNPs between reads without a reference using Kmers

ADD COMMENT

Login before adding your answer.

Traffic: 2424 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6