Hi everyone,
Could I please ask two questions, both related to SNPs.
Question 1:
I have .bim/.bed file for a set of SNPs. I want to remove SNPs from my data set that have a >2% Mendelian error rate.
I tried:
./plink --bfile IN_FILE --me 0.02 0.02 --noweb --make-bed --out X.
But the output tells me 0 SNPs were removed. I am working with 500K,100K and 50K arrays and I would have expected at least one SNP to be removed. I messed around with increasing and decreasing the -me 0.02, 0.02 parameters, and nothing is ever removed.
Could someone tell me the correct command to remove SNPs from a data set that have >2% Mendelian error rate?
Question 2:
I have a set of SNPs (it's an Affymetrix 100K). How do I tell whether a SNP is on the plus or minus strand?
For example, I have a set of SNPs. The information I have for each SNP is:
Chr,Pos,Submitter_snp_name,Ss#,Rs#,Genome_build_id,ALLELE1_genome_orien,ALLELE2_genome_orien,ALLELE1_orig_assay_orien,ALLELE2_orig_assay_orien,QC_TYPE,SNP_flank_sequence,SOURCE,Ss2rs_orientation,Rs2genome_orienation,Orien_flipped_assay_to_genome
This is an example of a SNP (let's call it SNPx):
12,744051,SNP_A287197,ss7481221,rs31368,36.2,G,A,C,T,A,TCGGCCTGCAGTCCTCC[A/G]CTCTCAGGTTTGCAC,HuGeneFocused,+,-,y.
And I have a set of genes and their location in the genome, for example:
EntrezID Chr # GeneStart GeneEnd Strand
1 12 744000 9067900 minus
4 12 744001 130887675 plus
111 3 123282296 123449077 minus
142 1 226360691 226408100 minus
185 3 148697871 148743003 plus
This is entrez ID, chromosome number, their start and end position in the genome, and whether they are on the plus/minus strand.
I need to find out whether SNPx lies within any of these genes.
I am confused as to how to tell whether the SNP is on the plus or minus strand?
(1) , in this case, would SNPx belong to Entrez gene ID 1, or 4? They both are on the same chromosome in about the same position, just one is on the plus strand and one is on the minus strand?
(2) Do I need to account for ss to rs orientation, and rs to genome orientation? How should I do this, if so?
Thanks
Aoife
How many samples are in your bed file?
Well I'm using three data sets: 50K, 100K and 500K data set, so after other quality filtering measures, there's 35,000 in the 50K, 90,000 in the 100K and 300,000 in the 500K data set, for about 1,000 people.
Thanks.
Two quick questions:
(i) are you certain that Mendelian errors weren't filtered out before this dataset got to you?
(ii) how many families/trios does PLINK report being present?