Problems Extracting Non-Snps From A Vcf File
1
0
Entering edit mode
11.9 years ago
GPR ▴ 390

Hello,

In an SNP analysis, I am trying to extract those editing sites no found in the dbSNPs vcf file I have downloaded a couple of files (All SNPs and Common/Medical SNPs) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF.

Following this, I have compared my VarScan *.vcf outputs with the SNP.vcf ones using 3 different approaches:

VarScan compare input.vcf SNP.vcf unique1 input-SNPvcf 

bedtools intersect -v -a input.vcf -b SNP.vcf > input-SNP.vcf 

bedops --not-element-of -1  input-sorted.bed SNP-sorted.bed > inputs-sorted-SNP.bed

In all 3 cases, the SNP-output is identical to the input.vcf/bed.

These command-lines however work when I use an alu.bed or a repeat-masker-bed.

Is it just that my analysis contains no known SNPs? I have discarded for obvious reasons.

Can somebody point a the reason/solution to this problem?

Thanks, G.

varscan bedtools dbsnp • 3.2k views
ADD COMMENT
2
Entering edit mode

in general I recommend a "defensive style" of data analysis - first try to verify that the commands as invoked do what you think they should be doing. For example add a new, artificial entry to your input file and see wether you can find it via your commands.

ADD REPLY
2
Entering edit mode

Do the chromosome names match? In particular, is one file using 'chr' prefix and the other not?

ADD REPLY
1
Entering edit mode

That's actually the reason of my despair. The dbSNPs file doesn't have the 'chr' prefix. Is there a way of adding the 'chr' prefix to the SNP.vcf?

ADD REPLY
2
Entering edit mode

Just use awk or perl or another scripting tool to edit the first field of the non-comment lines in your VCF file.

ADD REPLY
2
Entering edit mode
11.8 years ago
dankoboldt ▴ 140

GPR, thanks for the question. Sean and Alex, thank you (again) for answering!

First of all, here's a quick command to strip a leading "chr" from a VCF file (this actually edits the file): perl -pi -e 's/chr//' my.vcf

Alternatively you could do so while creating a new file perl -pe 's/chr//' my.vcf >new.vcf

I'd actually recommend a different tool for comparing variant calls to dbSNP using VCF files. That tool (also developed here) is called joinx, and it's available on GitHub:

http://gmt.genome.wustl.edu/joinx/1.6/index.html

The joinx "vcf-annotate" command will use the information in dbSNP to update the ID and INFO fields of your variant VCF... then you have a new VCF in which known SNPs are marked with dbSNP information, which is broadly useful.

ADD COMMENT
0
Entering edit mode

Wow, you went on an major answering blitz - no VarScan question left behind! Awesome. And thanks a lot.

ADD REPLY
0
Entering edit mode

Just saw this. Thanks for the recommendation. I actually ended up using awk to change the Chr IDs. I used bedtools to intersect difference and it seems fine. I will however try your suggestion.

ADD REPLY

Login before adding your answer.

Traffic: 2552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6