Remove a list of positions form a VCF file
3
1
Entering edit mode
8.1 years ago
shinken123 ▴ 150

Hi

I have a list of chromosomes and positions that looks like this:

1   10045
1   93056
1   109272
1   127711
1   127822
.
.
.

And now I would like to use it to remove them from my vcf file. Do you know how to do this?

SNP vcf filter • 12k views
ADD COMMENT
7
Entering edit mode
6.0 years ago

bcftools can do this:

$ bcftools view -T ^list_snp_exclude.txt input.vcf > output.vf

With the ^ before the file with the coordinates one tell bcftools to exclude these regions.

fin swimmer

ADD COMMENT
0
Entering edit mode

just wondering if I wish to add 2 more columns "Alternate" and "Reference" what should I change in the above command? because for me this didn't work.

ADD REPLY
3
Entering edit mode
8.1 years ago

a simple grep would do:

grep -vf list.txt file.vcf
ADD COMMENT
2
Entering edit mode

Though this was posted a while ago, I just have to say that if you grep with just the -vf flags, it will remove positions that are in list.txt from file.vcf but it will also remove additional positions that might be comprised of more digits and still contain the sequence of digits of the positions from the list. For example, you may want to remove position 10045, but if the vcf contains the positions 100450, 1004511, 100453489 etc, these will be removed as well.

In this case the -w flag should also be added to the above which greps words, that is it greps the patterns that are given if they are preceded and followed by whitespace.

ADD REPLY
0
Entering edit mode

Thank you very much. The only problem with grep for me is that was very slow and memory consuming so I use this link

So I transform my file to a bed file like this:

1   6405767 6405767
1   8108895 8108895
1   8623336 8623336
.
.
.

May be is not the most elegant way to do it but works for me.

ADD REPLY
1
Entering edit mode
6.0 years ago
YocelynGG ▴ 70

Hi!!

If someone has the same question, this loop has solved the problem

grep -Fwvf list_snp_exclude file.vcf > new_filter.vcf

list_snp_exclude: It's a list with the format Chromosome_name"\t"Position

Chrom_177   4393715
Chrom_177   4394618
Chrom_177   4395751
Chrom_215   4395751
Chrom_215   4396373
. . .
ADD COMMENT
0
Entering edit mode

How is this different from Jorge's answer above?

ADD REPLY
1
Entering edit mode

my answer was very simple. this one adds more grep functionality: -F option looks for fixed strings rather than regular expressions, and -w option looks for whole words rather than just matching patterns. I don't know how -F works in conjunction with -w, but it looks like an overall faster option. if performance is to be considered, maybe a better aimed regex (-P option needed) could also be even faster:

sed 's/^/^/; s/$/\\t/' list.txt | grep -vPf - file.vcf
ADD REPLY

Login before adding your answer.

Traffic: 1476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6