Entering edit mode
3.8 years ago
biohacker_tobe
▴
80
Hello Community,
is there existing software or algorithms two compare genome files, possibly determine if they are the same or not?
Thanks :)
do a hash (md5sum) and compare the hashes. or post an example how you want to compare. Please note that I am aware of fastq format. Do not share the link to fastq format.
That's an interesting take to this problem, this is an example. As you can see both are the same, I just would like a negative or positive reply depending on if they are the same or not. FASTQ file 1:
FASTQ file 2:
If you have visual evidence like this then using a hash my be fine.
Let me illustrate a variation. Even if there is a single difference e.g. switched order of sequences.
Here is file2
then this will produce a different sum even though the data is the same.
This looks awesome, I will definitely try this out :)
how do you want to compare it ? they're exactly the same ? same but unordered sequences ?
I'm not sure if these files are the same. I have a directory with different FASTQ files, basically what I want to see if they are exactly the same. Was thinking of comparing sequence/quality lengths and labels...
You need to be very specific in defining your requirement. Are you thinking there are identical copy of the data with a different file name or do you think it is the same sample(s) that was re-sequenced again?
Sorry for the lack of clarification on my behalf... I believe that it's possible that I have samples that have been re-sequenced again.
So basically you want to see if these are technical sequencing replicates or not.
You could align the data independently to a reference and see if you are able to call identical SNP's for the data files. Short of knowing real experimental provenance this is likely be the closest you can informatically get to deciding if the data came from the same sample.
I think you can still use the hash approach, but look at
mash
distances instead.I think you can do it with fastqs, but not 100% sure. This will tell you to some level of accuracy that the genomes are very similar or the same. An actual md5sum will only work if the files are identical as others pointed out, so a resequencing of the same sample/genome will not necessarily give you an identical md5, but a
mash
distance should be instructive.If you can't use fastqs, you can definitely use contigs, so you can just assemble your data first.