I have a large database of SNPs (>53 million). Each SNP has a identifier in column 1. The remaining columns (483) contain genotype data (0,1,-1). Some SNPs are duplicated in the database. I have tried to remove the duplicates but can't seem to weed them out with common unix commands such as:
awk '{a[NR]=$0; a[NR,"k"]=$1; k[$1]++} END {for (i=1; i<=NR; i++) if (k[a[i,"k"]] > 1) print a[i]}'
or
awk '!a[$1]++'
The SNP identifiers look like this Contig0_50
, so the awk commands finds non exact duplicates e.g. Contig0_500
Can someone suggested how to remove duplicate identifiers in the database?, i.e. if identifier 1 is exactly the same as identifier 2, remove the entire row which contains identifier 2. The resulting database will only have unique identifiers.
Thanks,
James
What file format is your database in?
It's a flat text file
This is odd. In particular, the second
awk
command would be the textbook example of how to achieve the deduplication of a file, and I also can't reproduce this bug with the example SNP names you have given. Admittedly, I don't think that the similar identifiers are the problem.Rather, I fear, that the file might be too big for your memory, because what you are essentially doing here is writing the whole file into the array
a
. Given that you say it is a file of 53 million records and 484 columns, awk or your memory might just be unable to handle this.You could try to use
sort -u db.txt > outdb.txt
, but that will only remove duplicated lines and not solely duplicated IDs and might also run into memory issues.Other than that, you would need to find a way how to remove duplicates without keeping the whole file in memory. This here could work:
This will only compare the current identifier to the identifier in the previous line, without keeping any more than 2 identifiers in memory at a given time, and evidently requires a properly sorted input.
I haven't thoroughly tested it, though, so please at least also run the cross-control with
if (a[NR-1]==$1)
to print only the duplicates...That fixes it:
Thanks!
You're welcome!
PS: To print out duplicates, including the first occurrences: