Hi,
I hope this is the right forum to ask you for some advice on how to analyse SNP data.
I have analysed the SNPs in different strains of the same species (i have 30 of them). I have recorded the SNP number count for all these strains, in four different genomic regions (coding, non-coding...etc). The data are stored in a file with three columns: $1-SNPcount; $2-strain; $3region. "SNP" is thus a numerical variable, while "strain" and "region" are factorial.
How would you advise to statistically analyse those data? I plotted the %of SNP in each region per strain, but this wouldn't obviously take into account the richness in SNP of each strain. I might think of doing a glm(SNP~strain+region), but obviously the results of the model would definitely depend on which variable level you choose as "reference".
I am grateful for any constructive advice :)
I don't see what is the question you want to answered. Do you want to know if some strains/regions have more SNPs than others?
yes, exactly.
I would suggest you to first control for coverage, to see if this may bias your results. Then, you could compare the proportions of coding and non-coding SNPs (or any other "category") between strains using a Fisher's exact test.