Hello all,
I have vcfs from exome sequencing and I wanted to filter out from it variants from homopolymer regions with bed file covering the homopolymer regions by bedtools subtract, but to my surprise, amount of variants (rows in the file) increased after bedtools subtract. The command I used was:
bedtools subtract -header -a test.vcf -b ../../../external/homopolymer_5bp_chr01tochrY.bed > test_subtracted.vcf
cat test.vcf |wc
100000 1099205 43965799
cat test_subtracted.vcf |wc
100680 1106685 44503053
Amount of rows in header is the same:
cat test_subtracted.vcf | grep "#" |wc
144 789 11114
cat test.vcf | grep "#" |wc
144 789 11114
bedtools intersect
with original file (test.vcf, 100 000 variants) and file with "filtered-out" homopolymers shows 100684 rows). I am really confused - shouldn't intersect be maximally 100 000 variants?
bedtools intersect -header -a test.vcf -b test_subtracted.vcf | wc
100684 1106729 44505289
Do you have similar experience or at least some advice about how to continue? I read the bedtools manual properly but at this moment I am very confused.
Thank you very much!!
Martina
Please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.Ok, thank you, next time I will focuse on it
You can try running the diff command on the original and subtracted VCF file to see what the changes were. It may help figure out what is going on.