I am new to bioinformatics, and I have some new SNP data from an Affymetrix Axiom array. I have the genotypes exported into a giant tab-delimited table txt file where each row is a sample, starting with the rsID and each column being a sample.
Due to a quirk of the Axiom Human Origins array, there are ~4000 SNPs that were genotyped twice for each sample. The Affymetrix genotyping console for whatever reason does not merge the genotypes for these probes, meaning these genotypes show up twice in my data. Furthermore, the array designers fear these SNPs may actually be triallelic, which means I probably don't want to have to deal with them even more (ftp://ftp.cephb.fr/hgdp_supp10/8_12_2011_Technical_Array_Design_Document.pdf).
I have this big table of genotyping data. Can someone show me a template Python (or maybe Perl) script I can used to filter out the ~8000 lines that contain one of the offending rsids? I have a basic grip of these languages, but I don't know how to do this stuff on my own. Thanks!
Do you have an access to a linux-based os?
I have a Bio-Linux (Ubuntu) VirtualBox that I run through Windows.
and can you show us what the very first lines look like?
Here are the first few lines. Later columns have been deleted to make it easier to read (There are 92 samples originally)