I'm starting in this field of bioinformatics, and I'm quite enjoying. But my knowledge of biology is very small.
I have 700samples genotyped for 90 SNPs and I would like to build a dendrogram so I could divide my data into some clusters.
But most of the programs have a limit of 500 samples. Are you aware of some programs with larger limitations?
ClustalW has limit, like Phylip and Tcoffee.
I've read some papers and I'm thinking in trying MEGA, that is quite used..
The dendrogram will shown the distances between the samples, in that way I could also see some clusters
Lars, I suggested splitting up of the questions to Patricia, because I thought the context of dendrogram of SNP and the strains are different and get better attention if posted as separate questions.
if you are able to format your data into phylip format, I have found that the viewer called Archaeopteyx (the ATV successor, which is based on the forester library) is more than capable of dealing with hundreds and thousands of samples. I haven't tested if further, but the developers claim that it is the most powerfull approach for phylogenetic representation, and although I'm not a phylogenetic expert I have tested it along with a few others and no one has performed as well as this one.
here are the references for further reading:
Han M.V. and Zmasek C.M. (2009).
phyloXML: XML for evolutionary
biology and comparative genomics. BMC
Bioinformatics, 10:356.
Zmasek C.M.
and Eddy S.R. (2001) ATV: display and
manipulation of annotated
phylogenetic trees. Bioinformatics,
17, 383-384.
It is not clear from the question if the SNPs reside on a contiguous sequence - in which case one could try Clustal W et al. - or are spread throughout a genome. This question needs revision...
I've tried ClustalW and it doesn't work, because it was a limit of 500sequences.
I only have some SNPs and that's what we are analyzing, that is why only those SNPs are aligned. It's a smaller sequence
First off, you'll have to specify what species you mean.
Assuming you mean strain of mouse, your first and best bet is to know this ahead of time. You should be aware that model organisms can be inbred (meaning two or more of the same strain are expected to have identical or near-identical genomes, depending on the degree of inbreeding) or of a mixed background. Many experiments are done with mixed-background animals, particularly those where a mutation on one strain is being crossed onto another strain with some useful feature (e.g. it will activate the mutation in a particular organ). If you are sure the strains are inbred, but you don't know what strains they are, you are still in a near-hopeless situation. If you know the samples are one of X strains, where X is some suitably small number such as two, you may have a shot. Various SNPs are informative for strain differences, and two places to get started are the Jackson Labs database for SNP variation (http://www.informatics.jax.org/javawi2/servlet/WIFetch?page=snpQF) and the Sanger center mouse genome project (http://www.sanger.ac.uk/Projects/M_musculus/).
I entered this response to answer a question that was titled "Reference strains, how to identify strains?". That question has disappeared, and it got appended to this question.
Yes, might have been a mistake of me to merge the two - not quite sure. It was clearly two very, very similar and closely related questions by the same person.
It is not quite clear from your question, but I assume that what you are talking about is genotyping of a particular species of bacteria. Since they have been genotyped for 90 SNPs, my guess would be that these SNPs were not picked at random. Most likely, it is a set of SNPs that is commonly used for distinguishing between different strains of the bacterium in question.
My guess is thus that would you should really look for is a genotyping database of strains of the particular species that you are working on. Most likely many, many more different strains have been genotyped than have been fully sequenced. If you find such a database, the analysis that you talk about would be a simple matter of comparing the SNPs from your samples to the reference samples in the database.
Unless you tell us which species it is you are working on, I don't think we will be able to help you much further.
sorry, I didn't say it.. But yes, you're right. My reference is Mycobacterium H37Rv, and the SNPs were choosen for differents reasons, drug resistance, etc..
There is some strains fully sequenced, like Bovis, F11.. Since the size of genomes changes, one position in H37Rv doesn't mean is the same in Bovis, for instance.
what is your Dendrogram about ? What are the softwares you tested ?
ClustalW has limit, like Phylip and Tcoffee. I've read some papers and I'm thinking in trying MEGA, that is quite used.. The dendrogram will shown the distances between the samples, in that way I could also see some clusters
Would you like to split your question in to two separate questions for better visibility and better answers ?
Ok, I will make two questions
Sorry - the new question got flagged as a duplicate and I merged the two, only now seeing your comment here.
Lars, I suggested splitting up of the questions to Patricia, because I thought the context of dendrogram of SNP and the strains are different and get better attention if posted as separate questions.