Hi,
I'm running a big SNP database (160GB vcf file) I pruned with plink;
plink --vcf SNPs_clean1.vcf --allow-extra-chr --indep-pairwise 50 10 0.1 --out SNP_50_10_01
It starts running producing the temporary files and the output files prune.in and prune.out). In the log file it shows having filtered the variants, 49595998 of 57089576 variants removed according to log file attached below. When I count the number of lines of the prune.in file, it does match the number of variants removed.
The problem is that when I open the file it's just a bunch of dots (one dot per line). The same happens for the prune.out file. I've looked for similar errors in this page but I've only seen blank prune.in and prune.out files so I hope this isn't a repeated question.
Thanks in advance for your help, Ana
Log file:
PLINK v1.90b5.3 64-bit (21 Feb 2018)
Options in effect:
--allow-extra-chr
--indep-pairwise 50 10 0.1
--out SNP_50_10_01
--vcf SNPs_camsud_clean1.vcf
Hostname: XXXXXX
Working directory: /home/aagapito/POTW/mapeo_vicugna/cardiff
Start time: Tue Nov 13 08:23:34 2018
Random number seed: 1542108214
257764 MB RAM detected; reserving 128882 MB for main workspace.
--vcf: SNP_50_10_01-temporary.bed + SNP_50_10_01-temporary.bim +
SNP_50_10_01-temporary.fam written.
57089576 variants loaded from .bim file.
56 people (0 males, 0 females, 56 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SNP_50_10_01.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 56 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.976739.
57089576 variants and 56 people pass filters and QC.
Note: No phenotypes present.
Pruned 532 variants from chromosome 27, leaving 67.
Pruned 438334 variants from chromosome 28, leaving 68336.
Pruned 3957 variants from chromosome 29, leaving 364.
Pruned 281964 variants from chromosome 30, leaving 41240.
Pruned 247860 variants from chromosome 31, leaving 38534.
Pruned 201756 variants from chromosome 32, leaving 29178.
(the list goes on...)
**Pruning complete. 49595998 of 57089576 variants removed.**
Marker lists written to SNP_50_10_01.prune.in and SNP_50_10_01.prune.out .
End time: Tue Nov 13 08:32:21 2018
I was thinking about this problem few days ago and realized it could probably be solved with a simple
awk
script. After a little bit of playing around and short testing, I found this oneliner to do the trick:However I tested it only briefly, so use it with caution. If you find some issue with this oneliner, please let me know, either here or by commenting under the gist on github.
The oneliner works like this:
BEGIN{OFS="\t"}
= output will be tab-delimited!/#/
= skip comment lines containing#
(i.e. print them unmodified){sub(/\./, $1"_"$2, $3)}
= substitute.
in third column ($3
, i.e. the ID column) with combination of$1
and$2
(i.e. CHROM and POS), separated by "_"1
= This condition (always true) prints all lines, modified or not.FYI: Both the original script and this oneliner assume that your variants have unique positions, i.e. you don't have the same CHROM & POS combo on more than one row (this could be violated e.g. after decomposing complex variants). If your VCF does have this issue, the scripts would need to be updated (one way would be to merge not just CHROM & POS columns, but to include also e.g. ALT column, so the ID would be CHROM_POS_ALT). Again, if you need that, let me know and I can show how to do that (if it's not clear from the code itself).