Question

removing duplicate SNPs (same position) with lowest call rate

0

Entering edit mode

8.8 years ago

jani.p.heikkinen • 0

I am trying to solve a problem with my genotyped array data set. For reason or another, the data set has duplicate or with three different names pointing to the same position. For example:

index  SNP    pos        A1  A2  F_MISS
2046   snp_1  113890304  C   T   0
2047   snp_2  113890304  C   T   0.000422
2048   snp_3  113890304  C   T   0

I want to build a list for SNP names to be removed (so I can exclude them in PLINK).

So from the SNPs above, snp_1 or snp_3 and snp_2 should be in removal list.

How would I achieve this?

snp genome • 2.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by jani.p.heikkinen • 0

Ram · Answer 1 · 2016-01-25

0

Entering edit mode

8.8 years ago

TriS ★ 4.7k

If you just want to remove duplicates in R (not tested):

name_position <- apply(mySNPmatrix,1,function(x) paste(x[2],x[3],sep="_"))
mySNPmatrix <- mySNPmatrix[-which(duplicated(name_position)),]

However, it seems that the F_MISS col is not duplicated, so pay attention to that when removing rows

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by TriS ★ 4.7k

Ram · Answer 2 · 2016-01-25

0

Entering edit mode

8.8 years ago

christopher medway ▴ 460

Try this bash one-liner (not tested). You may need to lose the header line though.

sort -k 3 -k 6 input.txt | awk '!seen[$3]++' | awk '{print $2}' > output.txt

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by christopher medway ▴ 460