Entering edit mode
2.3 years ago
juntkym
▴
20
Hi there, Are there any good ways to sort a vcf file by a value in the INFO column (say INFO/CADD score)? Thanks
Hi there, Are there any good ways to sort a vcf file by a value in the INFO column (say INFO/CADD score)? Thanks
You can use the following:
bcftools query -f '%CHROM %POS %REF %ALT %INFO/CADD\n' yourFile.vcf | sort -k5 -g -r > new.vcf
Use the -f tag to indicate the columns that you want to keep, then sort the column containing the CADD score (k5 in this example). Note the -g tag to sort numerically and the -r tag to sort in descending order. You can alter the code to suit your needs.
I wrote https://lindenb.github.io/jvarkit/SortVcfOnInfo.html
$ curl "https://raw.github.com/arq5x/gemini/master/test/test4.vep.snpeff.vcf" |\
java -jar dist/sortvcfoninfo.jar -F BaseQRankSum | grep -vE "^#"
chr10 1142208 . T C 3404.30 . AC=8;AF=1.00;AN=8;
chr10 135336656 . G A 38.34 . AC=4;AF=1.00;AN=4;
chr10 52004315 . T C 40.11 . AC=4;AF=1.00;AN=4;
chr10 52497529 . G C 33.61 . AC=4;AF=1.00;AN=4;
chr10 126678092 . G A 89.08 . AC=1;AF=0.13;AN=8;BaseQRankSum=-3.120;
chr16 72057435 . C T 572.98 . AC=1;AF=0.13;AN=8;BaseQRankSum=-2.270;
chr10 48003992 . C T 1047.87 . AC=4;AF=0.50;AN=8;BaseQRankSum=-0.053;
chr10 135210791 . T C 65.41 . AC=4;AF=0.50;AN=8;BaseQRankSum=2.054;
chr10 135369532 . T C 122.62 . AC=2;AF=0.25;AN=8;BaseQRankSum=2.118;
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you very much for your quick reply! Sorry for my lack of words but I was thinking that I wanted to keep all the other column values in the output and in the vcf format. Your answer inspired me to write down like below, but if there are any cleaner way, please let me know. Anyway, you helped me a lot. Thanks again!
Why do you want to sort your vcf? This seems like an unusual thing to do, you almost always want it sorted by genomic coordinate position.
Yes, I agree it’s unusual. The sorted vcf is to be both manually inspected by medical geneticists for identification of causal variants of rare diseases and my colleagues’ automatic variant interpretation program.
If someone else is looking at it, perhaps better to use
bcftools query
to pull out the relevant values and then sort those, rather than sorting the whole vcf. But IDK what your colleagues program does, so that may not work in it.