I have downloaded a couple hundred genome reads from the SRA and want to compute the genetic distances between each pair of reads. So far, the only way that I've been able to do so is to convert the SRA files to FASTQ format and use Dashing2 to compute the genetic distance matrix. Is this an appropriate method to obtain genetic distances between each pair of reads, or is there a better approach that you recommend?
Not sure why you're messing around with read (
fastq
) files. Just useMash
on the genomeFASTA
files: https://github.com/marbl/Mash .Likely because OP has short read files that are from genomes i.e. not actual genomes.
mikazon are you looking to compute this for reads or for full genomes represented by those reads?
Thank you for clarifying that. I misunderstood "sequenced genomes" in their statement to imply assembled genomes.
mikazon will still need to clarify.
There are two separate statements in original post that are non-consistent
Does not make sense to do this for every short read.
Thank you for your replies. Sorry if my original post did not make sense. Hopefully, I can clarify here. I am downloading reads such as this one; I am downloading these in
fastq
format, and I'm hoping to compute genetic dissimilarity between reads. So, with n reads, I should have n(n-1)/2 dissimilarity measures. I'm hoping that these dissimilarity measures are analogous to genetic dissimilarities between individuals from which these reads were obtained.I changed my language for clarity. Thanks for pointing out the confusion!
To my knowledge, I can only download the reads in SRA format or FASTQ format from the SRA. However, if you know how I can download
fasta
files, I would greatly appreciate that.For this purpose it doesn't matter much whether you get fasta or fastq from SRA. When he said "Use FASTAs" he meant to use assemblies instead of reads.