Remove rows with duplicate SNP identifiers in first column
0
0
Entering edit mode
2.4 years ago
bsp017 ▴ 50

I have a large database of SNPs (>53 million). Each SNP has a identifier in column 1. The remaining columns (483) contain genotype data (0,1,-1). Some SNPs are duplicated in the database. I have tried to remove the duplicates but can't seem to weed them out with common unix commands such as:

awk '{a[NR]=$0; a[NR,"k"]=$1; k[$1]++} END {for (i=1; i<=NR; i++) if (k[a[i,"k"]] > 1) print a[i]}'

or

awk '!a[$1]++'

The SNP identifiers look like this Contig0_50, so the awk commands finds non exact duplicates e.g. Contig0_500

Can someone suggested how to remove duplicate identifiers in the database?, i.e. if identifier 1 is exactly the same as identifier 2, remove the entire row which contains identifier 2. The resulting database will only have unique identifiers.

Thanks,

James

grep unix awk • 1.2k views
ADD COMMENT
0
Entering edit mode

What file format is your database in?

ADD REPLY
0
Entering edit mode

It's a flat text file

ADD REPLY
2
Entering edit mode

This is odd. In particular, the second awk command would be the textbook example of how to achieve the deduplication of a file, and I also can't reproduce this bug with the example SNP names you have given. Admittedly, I don't think that the similar identifiers are the problem.

Rather, I fear, that the file might be too big for your memory, because what you are essentially doing here is writing the whole file into the array a. Given that you say it is a file of 53 million records and 484 columns, awk or your memory might just be unable to handle this.

You could try to use sort -u db.txt > outdb.txt, but that will only remove duplicated lines and not solely duplicated IDs and might also run into memory issues.

Other than that, you would need to find a way how to remove duplicates without keeping the whole file in memory. This here could work:

sort -k 1 db.txt > dbsorted.txt
awk '{a[NR]=$1; delete a[NR-2]}; {if (a[NR-1]!=$1){print $0}}' dbsorted.txt > dbdedup.txt

This will only compare the current identifier to the identifier in the previous line, without keeping any more than 2 identifiers in memory at a given time, and evidently requires a properly sorted input.

I haven't thoroughly tested it, though, so please at least also run the cross-control with if (a[NR-1]==$1) to print only the duplicates...

ADD REPLY
0
Entering edit mode

That fixes it:

wc -l dbsorted.txt 
53650725 dbsorted.txt

    wc -l dbdedup.txt
51456321 dbdedup.txt

awk '{a[NR]=$1; delete a[NR-2]}; {if (a[NR-1]==$1){print $0}}' dbsorted.txt > onlydedups.txt
wc -l onlydedups.txt 
2194404 onlydedups.txt

Thanks!

ADD REPLY
0
Entering edit mode

You're welcome!

PS: To print out duplicates, including the first occurrences:

sort -k 1 db.txt > dbsorted.txt
awk '{a[NR]=$1; b[NR]=$0; delete a[NR-2]; delete b[NR-2]}; {if (a[NR-1]==$1){print b[NR-1]"\n"$0}}' dbsorted.txt > allduplicatedentries.txt
ADD REPLY

Login before adding your answer.

Traffic: 2361 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6