Remove duplicates in an extremely large text file
4
3
Entering edit mode
5.8 years ago
OAJn8634 ▴ 60

I have a very large text file (9.2GB) that contains data in two columns (named All_SNPs.txt). I would like to use this file to convert my chr:pos to rs in my plink files. The file looks like this:

1:116342    rs1000277323
1:173516    rs1000447106
1:168592    rs1000479828
1:102498    rs1000493007

However, plink produced an error stating that All_SNPs.txt contains duplicates in column 1. I have tried different ways of removing the duplicates; specifically, I attempted the following line commands:

awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt

sort  All_SNPs.txt | uniq -u > All_SNPs_nodup.txt

cat All_SNPs.txt | sort | uniq -u > All_SNPs_nodup.txt

However, each time I faced the same two problems: 1) either, the code will not make any difference and I will receive the same file back (this is for line commands that uses cat) or 2) I will receive an error : the procedures halted: have exceeded (this is for sort | uniq, and awk).

I will be very grateful for any ideas of how I can make this work. Thank you very much.

PS, this file is far too large to open it in R.

awk uniq plink SNP • 7.8k views
ADD COMMENT
2
Entering edit mode

You could try to convert the input file to a valid vcf and use than bcftools sort and bcftools norm -N -d none to remove the duplicates. At the end you can convert back to the input format.

ADD REPLY
0
Entering edit mode

Using datamash:

 datamash -sg 1 unique 2  <test.txt

datamash is available in brew, conda, apt repos.

using tsv-utils :

tsv-uniq --ignore-case -H -f 1 test.txt
ADD REPLY
0
Entering edit mode

Also try with a hashtable, like a python dict. No idea how much memory it would require but you can try if nothing else works.

ADD REPLY
5
Entering edit mode
5.8 years ago
bruce.moran ▴ 970

Cut and uniq to find duplicates, then grep -v them away. Relatively quick on 389MB dummy file.

time sort -V All_SNPs.txt | cut -f 1 | uniq -c | perl -ane 'if($F[0] ne "1"){print "$F[1]\t$F[0]\n";'} > All_SNPs.dup.chr-pos.txt

real    0m57.428s
user    3m19.091s
sys     0m3.152s

time cut -f 1 All_SNPs.dup.chr-pos.txt | grep -wvf - All_SNPs.txt  > All_SNPs.nodup.txt

real    0m2.516s
user    0m1.697s
sys     0m0.229s
ADD COMMENT
2
Entering edit mode
5.8 years ago
Shred ★ 1.5k

Split the text file into smaller one using the split command. You could split by size, as example:

split -b 200m filename

This will produce files named ' xaa, xab, xac..' . Now use awk, but with simpler syntax

awk -F"\t" '!seen[$1]++' xa*

And after that, join files using a sample cat into the destination file.

ADD COMMENT
2
Entering edit mode

How big is the chance that two dups end up in different files?

ADD REPLY
2
Entering edit mode

You can use awk to split your file according to the chromosomes:

awk '{ split($1, a, ":"); print $1"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

That assures you that you don't miss two dups. If you like to reduce the file size a bit, you can remove the chr: :

awk '{ split($1, a, ":"); print a[2]"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

This will also allow you to keep a bit more items in the hash.

ADD REPLY
0
Entering edit mode

Shit happens. But if a file is too large for ram, in bash there's no way to map into memory. Maybe a solution would be in python, as explained here

ADD REPLY
0
Entering edit mode

Yeah the link is worth a try, or here another awk solution. Can't test it myself, though.

ADD REPLY
0
Entering edit mode

If you sort first I guess the chance is very small no?

ADD REPLY
0
Entering edit mode

Sort loads file into memory too.

ADD REPLY
0
Entering edit mode

Do you actually have this as a PLINK dataset? Why not try to use PLINK functionality to update the map file? For example, --list-duplicate-vars lists duplicates, which can then be excluded

ADD REPLY
0
Entering edit mode

This has worked. Thank you very much for the suggestion!

ADD REPLY
0
Entering edit mode
5.8 years ago
Benn 8.3k

Did you try:

awk '!seen[$1]++' All_SNPs.txt > All_SNPs_nodup.txt

If it doesn't work I think you need better hardware...

ADD COMMENT
0
Entering edit mode

OP tried: awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt @ b.nota

ADD REPLY
0
Entering edit mode

Thank you for your suggestion. I have tried this command a few times but unfortunately I get an error: Cannot allocate memory

ADD REPLY
0
Entering edit mode
5.8 years ago

Plink has basic mechanism to deal with dups

--list-duplicate-vars <require-same-ref> <ids-only> <suppress-first>

https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars

Then the duplicated vars can be excluded using --exclude plink.dupvar

ADD COMMENT

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6