Hello,
In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.
Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:
VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf
bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf
bedops --not-element-of -1 input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed
In all 3 cases, the SNP-output is identical to the input.vcf/bed.
These command-lines however work when I use an alu.bed or a repeat-masker-bed.
Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.
Can somebody point a the reason/solution to this problem?
Thanks, G.
in general I recommend a "defensive style" of data analysis - first try to verify that the commands as invoked do what you think they should be doing. For example add a new, artificial entry to your input file and see wether you can find it via your commands.
Do the chromosome names match? In particular, is one file using 'chr' prefix and the other not?
That's actually the reason of my despair. The dbSNPs file doesn't have the 'chr' prefix. Is there a way of adding the 'chr' prefix to the SNP.vcf?
Just use
awk
orperl
or another scripting tool to edit the first field of the non-comment lines in your VCF file.