I have a large database of SNPs (>53 million). Each SNP has a identifier in column 1. The remaining columns (483) contain genotype data (0,1,-1). Some SNPs are duplicated in the database. I have tried to remove the duplicates but can't seem to weed them out with common unix commands such as:
awk '{a[NR]=$0; a[NR,"k"]=$1; k[$1]++} END {for (i=1; i<=NR; i++) if (k[a[i,"k"]] > 1) print a[i]}'
awk '!a[$1]++'
The SNP identifiers look like this Contig0_50
, so the awk commands finds non exact duplicates e.g. Contig0_500
Can someone suggested how to remove duplicate identifiers in the database?, i.e. if identifier 1 is exactly the same as identifier 2, remove the entire row which contains identifier 2. The resulting database will only have unique identifiers.
What file format is your database in?
It's a flat text file
This is odd. In particular, the second
command would be the textbook example of how to achieve the deduplication of a file, and I also can't reproduce this bug with the example SNP names you have given. Admittedly, I don't think that the similar identifiers are the problem.Rather, I fear, that the file might be too big for your memory, because what you are essentially doing here is writing the whole file into the array
. Given that you say it is a file of 53 million records and 484 columns, awk or your memory might just be unable to handle this.You could try to use
sort -u db.txt > outdb.txt
, but that will only remove duplicated lines and not solely duplicated IDs and might also run into memory issues.Other than that, you would need to find a way how to remove duplicates without keeping the whole file in memory. This here could work:
This will only compare the current identifier to the identifier in the previous line, without keeping any more than 2 identifiers in memory at a given time, and evidently requires a properly sorted input.
I haven't thoroughly tested it, though, so please at least also run the cross-control with
if (a[NR-1]==$1)
to print only the duplicates...That fixes it:
You're welcome!
PS: To print out duplicates, including the first occurrences: