Hi,
I am very new to Genomics and to the DNA clustering specifically. I have around 100 samples with each having millions of DNA Reads.
I used some sequence clustering method to make clusters for a single sample. But i think, clustring the millions of DNA reads of a single sample does not make any sense, eve though the clusters looks fine.
I want to find a way to cluster the all the 100 samples together instead of doing it sample by sample. For that, somehow, I need to select some reads from each sample to cluster and not all the billions of reads. I need to find some way to intelligently select subset from the reads of each sample (this subset of reads somehow should represent the whole sample). Once I have the subset of reads which is fairly representing the sample, I will be able to cluster all the samples together.
Any thoughts on how to tackle this problem.
One of the the solutions I thought of using PCA ? Any other Genomics specific approach to tackle this problem?
Try to map the reads first to the genome. Then focus on differences (mutations or variants), and cluster based on these differences.
let's say I map my reads from two samples to a reference genome (either exact mapping or approximate mapping). If we assume that 60% of my reads from one sample (Sample-A) maps to the reference while 40% do not match. Now do you suggest that I take these 40% unmatched reads? I do the same for Sample-B as well and take the unmatched reads only. Then I cluster unmatched from both the samples? Please correct me if my understanding it wrong?
My suggestion is to map your reads to a reference genome, and then call the variants. Use those variants for further clustering analysis. I am talking about human genome sequencing, but I am not sure if you are talking about that too. Please explain more in your question about what kind of data you are working with, and what your research question is.
Yes, I am working on human genome (as a reference). And I have DNA reads from let's say two samples (two different individuals) and my goal is to cluster those DNA reads from two samples. Ideally, I should get two clusters using kmeans or any other algorithm. The first challenge is that reads are mostly (99%) similar in all human beings. The second challenge is that number of reads for each individual is huge.
It doesn't make sense to cluster your raw reads. Why don't you want to focus on the variants like I suggested (and like the rest of the world is using?). Please explain why?
If you did whole genome sequencing I think the approach that Benn is suggesting is the best way. Map the reads against a reference, call the SNPs and compare the SNPs. And I am not sure but making a subsample in this case is not needed (or recommended?).
If you did amplicon sequencing, so only sequenced the genes of interest with a specific primer maybe you can check out this page: https://drive5.com/usearch/manual/pipe_otus.html. This method can also be done by VSEARCH. It is originally created to find OTUs but maybe it can help you to.
If you have a specific goal you may need to create a pipeline yourself using existing cluster tools.
Because of this I assume you did amplicon sequencing. If you cluster reads coming from the whole genome you will get many many clusters.
Making a sub sample is also called rarifying. There are many discussions about this and you can probably find enough tools to do it.
To explain what will be the best method for clustering we need more information about the goal. But for example OTU clustering with USEARCH and VSEARCH you can add a "sample identifier" to the reads. Then cluster and afterwards you can seperate the clusters per sample again with the help of an otu table.
also, PCA is a dimention reduction method, there are other methods specific for clustering (Partitioning clustering, Hierarchical clustering, etc. ) with other helping algorithms to aid choosing the right one and right number of clusters (Hopkins statistic, Elbow method, etc). As @Benn already indicated, focusing the analysis on subset of genes that carry related biological meaning to your study, would be the way for it. hth