SNP filtering
1
0
Entering edit mode
3.4 years ago
rheab1230 ▴ 140

Hello everyone,

I am trying to filter .vcf files to contain only single point SNP and remove indel.

I have remove indel from the .vcf file by using the following command:

grep -v $'\t-\t' in.vcf > no_indels.vcf

I am able to remove indel from .vcf file.

But I also want to remove any SNP corresponding to CAT --> C and C --> CTATGG

I just want my .vcf file to contain only A-T, or T-A or G-C.

I mean just one single point mutation only.

Can anyone please help me with this?

vcf SNP • 4.0k views
ADD COMMENT
1
Entering edit mode
3.4 years ago
Ram 44k

Please don't use grep to do this. Use a reliable VCF parser instead.

bcftools view -v snps should do the trick.

ADD COMMENT
0
Entering edit mode

Okay,

I have use bcftools as mentioned to filter out SNP from .vcf files.

I got the output file. but I also received some message after running the command.

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String
ADD REPLY
0
Entering edit mode

These warnings can be ignored most of the time, but it's usually better to fix them. For example, your header seems to be improper. If you're working on uncompressed .vcf files directly, try running

bgzip in.vcf
tabix -p vcf in.vcf.gz

after installing htslib. This will make sure your VCF header is proper.

ADD REPLY
0
Entering edit mode

Okay, I am able to do it. Thank You. Do you also know how to update id coloumn in .vcf file with rsid in dbSNP?

ADD REPLY
0
Entering edit mode

Your question on this topic was already answered by Wouter, I believe. Did that solution not work for you? Explore bcftools annotate if you'd like another option.

ADD REPLY
0
Entering edit mode

No, Its not working. the output file is not as required by predixcan. I am exploring bcftools annotate option But i want to know how should i download dbSNP142.vcf file. bcftools annotate function require dbSNP.vcf file but I have dbSNP.txt file.

ADD REPLY
0
Entering edit mode

I actually use bcftools to remove indels from .vcf files.as you mentioned. It worked. now my .vcf files contain only SNP. Then i thought of using annovar software to annotate my .vcf file with dbSNP142. but its getting annotated and its updated with rsid. but now indel is getting incorporated in my file

ADD REPLY
0
Entering edit mode

That does not make sense - if you're trying to annotate an SNPs only input file, the annotation process cannot add new entries (which is what would need to happen for you to see indels again).

ADD REPLY
0
Entering edit mode

You can annotate using a text file as well. NCBI's FTP site will have the dbSNP VCF file, but you should be able to use a tab delimited file with bcftools annotate

ADD REPLY
0
Entering edit mode

I am using bcftools. the code is:

#!/bin/bash
bcftools annotate -c ID \
-a 00-All.vcf.gz \
GEUVADIS.chr1.genotype.vcf.gz \
> file1.vcf

the error is:

Failed to open GEUVADIS.chr1.genotype.vcf.gz: not compressed with bgzip

So then i created the file using bgzip.

the code is :

bgzip GEUVADIS.chr1.genotype.vcf
 tabix GEUVADIS.chr1.genotype.vcf.gz

then i run the code again:

the code:

!/bin/bash
bcftools annotate -c ID \ -a 00-All.vcf.gz \ GEUVADIS.chr1.genotype.vcf.gz \ > file1.vcf

Now i am getting this error:

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
Failed to open 00-All.vcf.gz: could not load index
ADD REPLY
0
Entering edit mode

Please try tabix -p vcf instead of just tabix.

Please see how I've combined and formatted your comments so it's easier to read - you can also use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted.
code_formatting

ADD REPLY
0
Entering edit mode

I now use tabix -p GEUVADIS.chr1.genotype.vcf.gz but still getting error: unrecognized preset 'GEUVADIS.chr1.genotype.vcf.gz'

ADD REPLY
0
Entering edit mode

Please pay close attention to the command I wrote initially. The correct command is tabix -p vcf GEUVADIS.chr1.genotype.vcf.gz The -p vcf asks tabix to use the vcf preset, the vcf part is not a stand in for the actual vcf file.

ADD REPLY
0
Entering edit mode

I didn't understand. I am sorry but i am completely new to bioinformatics. Should i use the following command: tabix -p GEUVADIS.chr1.genotype.vcf.gz GEUVADIS.genotype.vcf.gz(all the combined vcf files)?

ADD REPLY
0
Entering edit mode
tabix -p vcf GEUVADIS.chr1.genotype.vcf.gz

Type that exactly

ADD REPLY
0
Entering edit mode

I am able to do the above command and not receive any error.

Then when I use annotate function of bcftools its showing error. the command:

bcftools annotate -c ID -a 00-All.vcf.gz GEUVADIS.chr2.genotype.vcf.gz

the error:

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
[W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String
Encountered error, cannot proceed. Please check the error output above.
ADD REPLY
0
Entering edit mode

Those are warnings and should not kill the process, but they indicate there's something wrong with the VCF file. Can you check the VCF headers and see what's going on?

I'm afraid this thread is becoming unmanageable and quite niche, and I won't be able to help you much longer with this.

ADD REPLY

Login before adding your answer.

Traffic: 1021 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6