Large tab-delimited table of SNP data, how to filter out some rows based on rsIDs?
2
0
Entering edit mode
9.5 years ago
devenvyas ▴ 760

I am new to bioinformatics, and I have some new SNP data from an Affymetrix Axiom array. I have the genotypes exported into a giant tab-delimited table txt file where each row is a sample, starting with the rsID and each column being a sample.

Due to a quirk of the Axiom Human Origins array, there are ~4000 SNPs that were genotyped twice for each sample. The Affymetrix genotyping console for whatever reason does not merge the genotypes for these probes, meaning these genotypes show up twice in my data. Furthermore, the array designers fear these SNPs may actually be triallelic, which means I probably don't want to have to deal with them even more (ftp://ftp.cephb.fr/hgdp_supp10/8_12_2011_Technical_Array_Design_Document.pdf).

I have this big table of genotyping data. Can someone show me a template Python (or maybe Perl) script I can used to filter out the ~8000 lines that contain one of the offending rsids? I have a basic grip of these languages, but I don't know how to do this stuff on my own. Thanks!

rsid python SNP • 4.3k views
ADD COMMENT
0
Entering edit mode

Do you have an access to a linux-based os?

ADD REPLY
0
Entering edit mode

I have a Bio-Linux (Ubuntu) VirtualBox that I run through Windows.

ADD REPLY
0
Entering edit mode

and can you show us what the very first lines look like?

ADD REPLY
0
Entering edit mode

Here are the first few lines. Later columns have been deleted to make it easier to read (There are 92 samples originally)

Probe Set ID    Chromosome    Chromosomal Position    dbSNP RS ID    Sample1    Sample2    Sample3    Sample4    Sample5    Sample6    Sample7    Sample8    Sample9    Sample10    Sample11
AFFX-KIT-000001    9    101258881    rs1000440    TC    TC    TC    TT    TC    CC    TC    CC    TC    CC    CC
AFFX-KIT-000002    4    164934874    rs10007601    AG    AG    AG    AG    AG    AA    AA    AA    AG    AA    AA
AFFX-KIT-000003    5    163542505    rs10056215    TT    TT    TG    TT    TT    TT    TT    TG    TT    TT    TT
AFFX-KIT-000004    5    2993645    rs10075407    GG    AA    AG    AA    AA    AA    AG    AG    AG    GG    AG
AFFX-KIT-000008    2    149188375    rs10196277    TG    TT    GG    GG    TG    TG    TT    TT    TT    GG    TG
AFFX-KIT-000009    12    51789617    rs1021996    CC    CC    TC    TC    CC    CC    CC    CC    CC    TC    CC
AFFX-KIT-000012    7    152738841    rs10266230    TT    TC    TT    TC    CC    TC    TC    TT    TC    CC    TT
AFFX-KIT-000014    13    27573612    rs10507375    GG    GG    TG    GG    GG    GG    GG    GG    GG    GG    GG
AFFX-KIT-000015    17    44878268    rs10514911    CC    TC    CC    CC    CC    CC    CC    CC    CC    CC    CC
AFFX-KIT-000016    5    168199301    rs10516050    CC    TT    TC    CC    CC    CC    CC    CC    CC    CC    TC
AFFX-KIT-000017    12    57489709    rs1059513    TC    TC    TT    TT    TC    TT    TT    TT    TT    TT    TT
AFFX-KIT-000018    10    129853731    rs10741141    AG    GG    AA    AA    AA    AG    GG    GG    AG    AA    AG
AFFX-KIT-000019    12    61604963    rs10784186    AC    AA    AC    AA    AA    AA    AA    AC    AC    ---    CC
AFFX-KIT-000021    4    28895078    rs10805281    GG    AA    GG    AA    AA    AG    AG    AG    AA    AA    AA
AFFX-KIT-000022    10    130069931    rs10829369    TC    TC    TC    TC    TT    TC    TC    TT    TT    CC    CC
ADD REPLY
4
Entering edit mode
9.5 years ago
rbagnall ★ 1.8k

Try this:

1. Get a list of duplicate rsID

cut -f 4 affydata.file | sort | uniq -d > duplicate_rsID.txt

2. Remove lines with duplicate rsIDs

sort -k4,4 affydata.file | grep -vwFf duplicate_rsID.txt - > uniqueAffyLines.txt
ADD COMMENT
1
Entering edit mode

Awesome, I think that works!

ADD REPLY
0
Entering edit mode

I have been trying this on another file, and I've noticed something weird.

I am trying to make a map and ped file, so I am starting with a tped-ish formatted file

For, the column with the rsids is (which is now the 2nd column), in the cells in the header row are normally empty, so I populated them with a,b,c,d,e,f in order to keep them from being marked as duplicates. I've noticed that they are getting sorted out of alphabetic order, and all the cells being labeled '00' for missing data in the genotyping area are turning into '0'. Is there anyway to stop this?

ADD REPLY
0
Entering edit mode

Could you give a few lines of the file like you did before.

all the cells being labeled '00' for missing data in the genotyping area are turning into '0'.

What are you using to edit the files; R, Excel?

ADD REPLY
0
Entering edit mode

It appears my reply to this from Saturday got deleted or did not save.

I was looking at the files in Terminal and Excel, and I had noticed the 00s became 0s, and the header rows were sorting out of order. My solution was to run the script to remove duplicates first (and then afterwards put the headers and 00s in). That seemed to work fine, so I ended up deleting the weird files.

ADD REPLY
0
Entering edit mode
9.5 years ago

If the line is a complete duplicate, you can filter out duplicates with awk:

$ awk '{ \
    a[$0]++; \
    if (a[$0] == 1) { \
      print $0; \
    } \
  }' table.txt > nonredundant_table.txt

Otherwise, can you describe how you decide what is an aberrant row?

ADD COMMENT
0
Entering edit mode

I was avoiding bash since the file is covering ~600,000 SNPs and 92 samples. I figured the file would be too big

The lines won't be complete duplicates. I have ~4000 SNPs where there are two independent probes genotyping the same locus. This means that there are ~8000 rows in the table that need to be ejected on the basis of the rsid. If the row's rsid is in the list, that row would be omitted from the table.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

The memory overhead is mostly limited to keys in the hash table. If you change $0 to instead store a specific field ($1, $2, etc.) or some combination of fields as a key in the hash table, you can greatly reduce the memory required to strip duplicates.

Failing that, if you can sort the file on the field or fields that are duplicated, then you can reduce the memory overhead to storing two lines, and simply comparing the current line with the previous line.

If the fields match, do nothing. If their fields do not match, you print out the current line and store the current line as the previous-line value. Then repeat on the next line, and so on.

The advantage of a hash table is that you can walk through the file once, in whatever order that is provided. So it is fast. But the memory overhead can be significant.

The advantage of the sorting approach is that you only need enough memory to store two lines (or the fields from two lines). But sorting the data costs time.

ADD REPLY
0
Entering edit mode

Hi, I've found the same thing: independent probes for the same locus. Often with totally different genotyping results. Have you found the same thing? And how did you deal with that? I was thinking about removing all the duplicates like them.

ADD REPLY

Login before adding your answer.

Traffic: 1842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6