Simple count of dbSNP (rs...) from .vcf file
1
2
Entering edit mode
6.7 years ago
oars ▴ 200

I have a .vcf file where I already have the dbSNP identifiers listed in the ID column. I am simply trying to count the number of variants in the dataset with a dbSNP (rs...) present (nulls are denoted as ".").

I've tried the below linux command but it seems incorrect as I've manually counted a different number:

grep -v rs SRR1611183.gatk.vcf | wc -l
1812

I'm looking for something that really targets the ID column, either counting dbSNP's or the nulls.

vcf dbSNP • 2.8k views
ADD COMMENT
4
Entering edit mode
6.7 years ago

Try this

cut -f3 SRR1611183.gatk.vcf | grep -e "^rs" | wc -l

It first select just the third column of the vcf file (which should be the ID column), than it greps all lines starting with "rs".

fin swimmer

ADD COMMENT
0
Entering edit mode

thanks for the reply. This returns 0.

I've also tried to count the nulls with the following linux command but also returned 0:

cut -f3 SRR1611183.gatk.vcf | grep -v "." | wc -l
ADD REPLY
1
Entering edit mode

Please post the first lines of your vcf file, including the header and some variants.

ADD REPLY
0
Entering edit mode
##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller  --output SRR1611183.gatk.vcf --intervals chr17.cds.bed --input SRR1611183.dedup.bam --reference genome.fa  --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --disable-tool-default-annotations false --emit-ref-confidence NONE --gvcf-gq-bands 1 --gvcf-.........

The actual header starts on line 52...

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878

...Then the actual data:

chr17   6115    .   G   C   2506.77 .   AC=2;AF=1.00;AN=2;DP=65;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.723    GT:AD:DP:GQ:PL  1/1:0,65:65:99:2535,196,0
chr17   6157    .   A   G   2638.77 .   AC=2;AF=1.00;AN=2;DP=66;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.73;SOR=0.818    GT:AD:DP:GQ:PL  1/1:0,66:66:99:2667,199,0
chr17   505139  .   G   A   2265.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=0.497;ClippingRankSum=0.000;DP=151;ExcessHet=3.0103;FS=14.242;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=15.11;ReadPosRankSum=0.178;SOR=1.804  GT:AD:DP:GQ:PL  0/1:84,66:150:99:2294,0,2990
chr17   636266  .   A   AGGGCC  40.73   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.431;ClippingRankSum=0.000;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=10.18;ReadPosRankSum=0.674;SOR=0.693    GT:AD:DP:GQ:PL  0/1:2,2:4:78:78,0,102
chr17   636320  .   T   TCTC    107.73  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.952;ClippingRankSum=0.000;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=11.97;ReadPosRankSum=-0.700;SOR=1.609  GT:AD:DP:GQ:PL  0/1:5,4:9:99:145,0,208

This continues to line 1813.

Danke!

ADD REPLY
1
Entering edit mode

I don't know whether this is a problem with the forum software. But when I copy and paste your example in a new file the columns aren't delimited by tabs but by different number of spaces.

Let's go through the command above step-by-step. What's the output of cut -f3 SRR1611183.gatk.vcf ?

fin swimmer

ADD REPLY
0
Entering edit mode

A bunch of header information, then a series of "."

.
.
.
ADD REPLY
1
Entering edit mode

Ok, so I guess you have no rs Id's in your vcf file. The command you use (grep -v ".") tries to get all lines which does not contain a ".". As all your line have a "." it could not find any which meets this criteria.

Why do you think you must have rs IDs in your vcf file? Which exact command did you use to annotate your vcf?

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks - I actually do have rs numbers in the ID field so I'm not sure whats going on. For example, I pasted the data into excel and found out of 1761 variants (POS) and 1467 (83%) had dbSNPs (rs...) identified. I just can't figure out a simple way to cut and count the rs#'s in the file as is. I guess I could remove all the information before line 52 (which is where actual header begins) and try again?

ADD REPLY
1
Entering edit mode

Is it possible that you upload the vcf file somewhere? A shortend version with some variants having a rs id and some not should be enough. And please let all the header Informations in.

ADD REPLY
1
Entering edit mode

I came up with a solution:

awk '$3=="." {print $0}' SRR1611183.gatk.dbsnp.vcf > ID_nulls.vcf
ADD REPLY
2
Entering edit mode

Congrats if you find a solution for you.

But I don't understand why my solution did'nt work for you if this awk way works.

Ok, the modified version with grep -v "." could not work correct, as "." have the special meaning "any character". So if you want to grep the dot you need to escape it like this grep -v "\." But as your not interested in all rows of the third column containing no dot (you would grep the header of the column as well) this is not the right solution. You want to check the first characters if the line, so you need grep -e "^rs".

fin swimmer

ADD REPLY
0
Entering edit mode

Please write it as an answer and close it so that others can easily find it and get help from. Thanks for providing better solution.

ADD REPLY
0
Entering edit mode
bcftools view -i '%ID=="."' test.vcf.gz | grep -v "#" | datamash count 3
ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6