Question

How to calculate true positive indels?

0

Entering edit mode

9.6 years ago

tangmanhd • 0

Hi,

I have a VCF file (ref.vcf) that states where are exactly the insertions and deletions in my genome. And my indel detection method produces its own VCF file (let's call it test.vcf).

To calculate the True Positives, I need detect the intersection of test.vcf and ref.vcf (I use exact intersection for the sake of simplicity for now). The True Positives, are the features in test.vcf that are also in ref.vcf. It is easy to understand the definition. But how to code it by hand? Or is there some software or package (in R or matlab)can be used to give the the correct results? Thanks.

R genome • 3.7k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.6 years ago by tangmanhd • 0

0

Entering edit mode

You want to calculate the intersection of two lists. R has a function named intersect().

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by karl.stamm 4.1k

0

Entering edit mode

If you want to classify true/false detection rates, you need to use synthetic data.

ADD REPLY • link 9.6 years ago by Brian Bushnell 20k

Ram · Answer 1 · 2015-12-18

bedtools intersect -f 1.0 -r -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t"} $4==$14 && $5 == $15'

Bedtools intersect

Bedtools allows to control the required overlap size (-f parameter) to adjust your criteria for being a true positive. For example, if you have very long inserts a 90% overlap might be sufficient.

Anyway, look at bcftools norm to make sure that the Indels from both VCF files are represented in the same way. Unfortunately, Indels can often be represented in more than one way, particularly in low-complexity regions. I recently ran into this problem...

EDIT: Change to include REF and ALT comparison between reference and alternative.

landesfeind@bionf-local:/tmp$ cat ref.vcf
##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample1
chr1    10    .    G    GA    10    PASS    .    .    .
chr1    110    .    T    TT    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ cat test.vcf
##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample1
chr1    10    .    G    GCA    10    PASS    .    .    .
chr1    110    .    T    TT    10    PASS    .    .    .
chr1    210    .    C    CA    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ bedtools intersect -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t&"} $4==$14 && $5 == $15'
chr1    110    .    T    TT    10    PASS    .    .    .    chr1    110    .    T    TT    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ bedtools intersect -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t"} $4==$14 && $5 == $15' | wc -l
1

Ram · Answer 2 · 2015-12-18

2

Entering edit mode

9.6 years ago

venu 7.1k

If both ref.vcf and test.vcf are in same format, you can use vcftools. In particular you may need vcf-isec

vcf-isec -n +2 ref.vcf.gz test.vcf.gz > TrueIndels.vcf

and before using vcftools, bgzip your vcf files and index with tabix.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 9.6 years ago by venu 7.1k

score 0 · Answer 3 · 2015-12-17

0

Entering edit mode

9.6 years ago

Tej Sowpati ▴ 250

Hi,

Is it mandatory that you have to use R or MATLAB? If not, have a look at vcftools. You can compare and filter vcf files in several different ways, and one of the most frequent usages is to identify common indels.

Cheers

ADD COMMENT • link 9.6 years ago by Tej Sowpati ▴ 250

Ram · Answer 4 · 2015-12-19

The GRanges and VariantAnnotation Bioconductor packages make this pretty easy in R. Roughly, you will need to:

Use readVcf to read in your two VCF files.
Use findOverlaps(...., type="equal").

Alternatively, simply concatenate the chromosome, start, and end into a string for each variant for each VCF file. Then, use a simple intersect on the two character vectors.