I have a very large text file (9.2GB) that contains data in two columns (named All_SNPs.txt). I would like to use this file to convert my chr:pos to rs in my plink files. The file looks like this:
1:116342 rs1000277323
1:173516 rs1000447106
1:168592 rs1000479828
1:102498 rs1000493007
However, plink
produced an error stating that All_SNPs.txt contains duplicates in column 1. I have tried different ways of removing the duplicates; specifically, I attempted the following line commands:
awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt
sort All_SNPs.txt | uniq -u > All_SNPs_nodup.txt
cat All_SNPs.txt | sort | uniq -u > All_SNPs_nodup.txt
However, each time I faced the same two problems: 1) either, the code will not make any difference and I will receive the same file back (this is for line commands that uses cat
) or 2) I will receive an error : the procedures halted: have exceeded (this is for sort | uniq
, and awk
).
I will be very grateful for any ideas of how I can make this work. Thank you very much.
PS, this file is far too large to open it in R.
You could try to convert the input file to a valid
vcf
and use thanbcftools sort
andbcftools norm -N -d none
to remove the duplicates. At the end you can convert back to the input format.Using datamash:
datamash is available in brew, conda, apt repos.
using tsv-utils :
Also try with a hashtable, like a python dict. No idea how much memory it would require but you can try if nothing else works.