Question

Amount of variants increased after bedtools subtract

0

Entering edit mode

6.0 years ago

Martina ▴ 30

Hello all,

I have vcfs from exome sequencing and I wanted to filter out from it variants from homopolymer regions with bed file covering the homopolymer regions by bedtools subtract, but to my surprise, amount of variants (rows in the file) increased after bedtools subtract. The command I used was:

bedtools subtract -header -a test.vcf -b ../../../external/homopolymer_5bp_chr01tochrY.bed > test_subtracted.vcf

cat test.vcf |wc
 100000 1099205 43965799
cat test_subtracted.vcf |wc
 100680 1106685 44503053

Amount of rows in header is the same:

cat test_subtracted.vcf | grep "#" |wc
    144     789   11114
cat test.vcf | grep "#" |wc
    144     789   11114

bedtools intersect with original file (test.vcf, 100 000 variants) and file with "filtered-out" homopolymers shows 100684 rows). I am really confused - shouldn't intersect be maximally 100 000 variants?

bedtools intersect -header -a test.vcf -b test_subtracted.vcf | wc
 100684 1106729 44505289

Do you have similar experience or at least some advice about how to continue? I read the bedtools manual properly but at this moment I am very confused.

Thank you very much!!
Martina

next-gen • 2.8k views

ADD COMMENT • link updated 6.0 years ago by finswimmer 16k • written 6.0 years ago by Martina ▴ 30

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 6.0 years ago by Ram 45k

0

Entering edit mode

Ok, thank you, next time I will focuse on it

ADD REPLY • link 6.0 years ago by Martina ▴ 30

0

Entering edit mode

You can try running the diff command on the original and subtracted VCF file to see what the changes were. It may help figure out what is going on.

ADD REPLY • link 6.0 years ago by colin.kern ★ 1.1k

score 0 · Answer 1 · 2019-07-31

The problem is, if you have a variant that span multiple regions in your bed file, this variant is output multiple times.

See this example:

input.vcf

##fileformat=VCFv4.2
##contig=<ID=chr4,length=190214555>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA06986
chr4    3074876 .   CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA    CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA 23.2353 .   .   GT  0/1

input.bed

chr4    3074870 3074878
chr4    3074880 3074885

output

$ bedtools subtract -header -a input.vcf -b input.bed 
##fileformat=VCFv4.2
##contig=<ID=chr4,length=190214555>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA06986
chr4    3074876 .   CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA    CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA 23.2353 .   .   GT  0/1
chr4    3074876 .   CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA    CCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCA 23.2353 .   .   GT  0/1

If you want to remove any variant, that overlap any entry in the bed file to use bcftools:

$ bcftools view -T ^input.bed input.vcf > output.vcf

(The ^ before input.bed says to not include regions in input.bed in the output.)