Question

Similarity between two sets of paired-end fastq files

1

Entering edit mode

9.1 years ago

Floydian_slip ▴ 170

Hi, Is there a tool out there which will allow me, in any way, to determine how similar two fastq file sets (paired-end) are? It could be any metric like number of identical reads etc. or any other metric which can be relevant in this case.

I need this to diagnose the reason behind low agreement of variant calls between two identical runs: if the fastqs are quite similar to each other, then it was the variant-calling pipeline and not the upstream bench-work.

Thanks!

fastq paired-end RNA-Seq • 4.8k views

ADD COMMENT • link updated 9.1 years ago by apelin20 ▴ 490 • written 9.1 years ago by Floydian_slip ▴ 170

1

Entering edit mode

May be the bamCorrelate from deeptools would be useful ? And you can inspect the bam files in genome browser where the variants do not agree. Why do you want to compare the fastq files instead of bam files ?

ADD REPLY • link 9.1 years ago by GouthamAtla 12k

0

Entering edit mode

I thought that comparing the fastq files will really narrow down the problem up to the sequencing steps (with out any downstream step having any effect). But I can compare the bams as well. Thanks!

ADD REPLY • link 9.1 years ago by Floydian_slip ▴ 170

0

Entering edit mode

That is a tough question. If the variant calls are in low agreement then you know there is a problem with the data. Don't know if it can be sorted out by finding how "similar" the two files are. Has the variant calling been repeated to rule out some issue with that.

ADD REPLY • link 9.1 years ago by GenoMax 150k

0

Entering edit mode

I thought that by comparing the fastq files I can rule out any effect of the downstream steps (QC, alignment, VC etc) which are different between the two sets can really narrow down the problem up to the sequencing steps. Thanks!

ADD REPLY • link 9.1 years ago by Floydian_slip ▴ 170

0

Entering edit mode

two identical runs

Do you mean that the same library has been sequenced on two different lanes of the same flow cell or in two different flow cells? Either way, the variability between lanes or flow cells should be very small. Is the number of reads between the two runs comparable (if one has many more reads you could pickup more SNPs)? I agree with John, in this case FastQC seems to be the easiest way to check how similar the fastqs are.

ADD REPLY • link 9.1 years ago by dariober 15k

2

Entering edit mode

9.1 years ago

John 13k

Have you run FastQC on both to see if there are contamination/sequence quality issues?

ADD COMMENT • link 9.1 years ago by John 13k

0

Entering edit mode

9.1 years ago

apelin20 ▴ 490

discoSNP... allows to compare SNPs between reads without a reference using Kmers

ADD COMMENT • link 9.1 years ago by apelin20 ▴ 490

score 3 · Accepted Answer · 2016-03-17

3

Entering edit mode

9.1 years ago

Frédéric Mahé ★ 3.2k

Maybe you could try commet? It was designed for metagenomics, but it allows you to compute a distance between two fastq files.

ADD COMMENT • link 9.1 years ago by Frédéric Mahé ★ 3.2k