Entering edit mode
7.7 years ago
mms140130
▴
60
Hi,
I have a 905460 snp genotype data with 1096 patients, I have used the "HardyWeinberg" R package and removed the maf that is less than 0.05, the data reduced to 746907, then I applied the HWexact test using alpha as 0.001 and the data reduced to 384660.
Is this OK? or too much data was lost
I'm doing GWAS analysis and the data was provided by my advisor it is about Brest Cancer
Please help me and recommend what should I do
Thanks,
Generally, I wouldn't focus "much" on the final number of SNPs, at the end one true positive SNP worths more than 100 need-2be-confirm SNP. Keeping in mind this is cancer, random mutations can happen at any stage. Moreover, it makes a difference whether you used SNP arrays to get these SNPs or you used SNP-calling methods. The type of outcome and the applied quality control will depend massively on that, thus, how much you expect to lose. Also, how many samples you had, control vs tumor? did you have relatives in your samples? If you are doing GWAS analysis you probably know the importance of taking these parameters into account.
As @Hasani pointed out, you shouldn't focus on the final number, because that tells you rather anything. For example, what happens if you divide it by the number of bases in the organism genome? Do you get a high mutation rate or one in the range?
Other discussion point are:
I recommend you to choose a more descriptive title. "help and recommendation please" doesn't tell anyone what you thread is about and might just be ignored because it's not very specific.
Also, your post doesn't contain enough information. You didn't state how the data was obtained or what the aim is of your analysis.
Done I have changed the title and add some information
I don't think @wouter wanted to know who gave you the data but rather where it comes from. For example: do you expect a lot of SNPs in a Breast cancer sample? (I would, but i'm not a breast cancer expert so this is just my guess).
My question also concerned the technology used to obtain the data. In addition, if this is a dataset on cancer you also should know if these mutations are germline mutations (e.g. from blood) or somatic mutations (from the tumor itself), and if the latter, which purity of the tumor you expect.
For GWAS errors in sequencing (including systematic) from my experience are way less important than errors in understanding your population structure upfront (and taking it into account) and inheritance pattern. If your population consists of say "normal mothers" and their "abnormal children" from ten different ethnic groups and phenotype is recessive, then running straightforward pink GWAS will find nothing, since allele is present in most mothers as well. Even if you do take this into account and forget about ethnicities you most likely will miss the right mutation or its p-value is not going to be low enough, because, for different ethnicities, different mutations stable within the population due to some compensatory positive effects can cause the same phenotype. Overall having 2 times fewer SNPs because of way too stringent filtration can cause more harm. I would focus on this only after preliminary analyses resulted in a few SNPs that can be confirmed or at least explained and only if my original ratio of SNPs per kilobase is way too high compared to expected (since this might rase question during the review process of your paper).
Thank you all the issue is I'm new to genetics and I'm trying to understand how to analyze such data
Great. Welcome to bioinformatics! And thank you for an interesting question that might help others too. You can accept one of the answers in order to show others that your question is solved, also you can bookmark any answer or your question to see it in bookmarks for future reference.
Since all reactions here were posted as comment they can't be accepted as the resolving answer of this question. I'm in doubt which comment here would be a satisfying answer, which could then be moved and accepted...