How to calculate true positive indels?
4
0
Entering edit mode
9.0 years ago
tangmanhd • 0

Hi,

I have a VCF file (ref.vcf) that states where are exactly the insertions and deletions in my genome. And my indel detection method produces its own VCF file (let's call it test.vcf).

To calculate the True Positives, I need detect the intersection of test.vcf and ref.vcf (I use exact intersection for the sake of simplicity for now). The True Positives, are the features in test.vcf that are also in ref.vcf. It is easy to understand the definition. But how to code it by hand? Or is there some software or package (in R or matlab)can be used to give the the correct results? Thanks.

R genome • 3.2k views
ADD COMMENT
0
Entering edit mode

You want to calculate the intersection of two lists. R has a function named intersect().

ADD REPLY
0
Entering edit mode

If you want to classify true/false detection rates, you need to use synthetic data.

ADD REPLY
2
Entering edit mode
9.0 years ago
bedtools intersect -f 1.0 -r -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t"} $4==$14 && $5 == $15'

Bedtools intersect

Bedtools allows to control the required overlap size (-f parameter) to adjust your criteria for being a true positive. For example, if you have very long inserts a 90% overlap might be sufficient.

Anyway, look at bcftools norm to make sure that the Indels from both VCF files are represented in the same way. Unfortunately, Indels can often be represented in more than one way, particularly in low-complexity regions. I recently ran into this problem...

EDIT: Change to include REF and ALT comparison between reference and alternative.

landesfeind@bionf-local:/tmp$ cat ref.vcf
##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample1
chr1    10    .    G    GA    10    PASS    .    .    .
chr1    110    .    T    TT    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ cat test.vcf
##fileformat=VCFv4.2
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample1
chr1    10    .    G    GCA    10    PASS    .    .    .
chr1    110    .    T    TT    10    PASS    .    .    .
chr1    210    .    C    CA    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ bedtools intersect -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t&"} $4==$14 && $5 == $15'
chr1    110    .    T    TT    10    PASS    .    .    .    chr1    110    .    T    TT    10    PASS    .    .    .

landesfeind@bionf-local:/tmp$ bedtools intersect -loj -a test.vcf -b ref.vcf | awk 'BEGIN{FS="\t"} $4==$14 && $5 == $15' | wc -l
1
ADD COMMENT
2
Entering edit mode
9.0 years ago
venu 7.1k

If both ref.vcf and test.vcf are in same format, you can use vcftools. In particular you may need vcf-isec

vcf-isec -n +2 ref.vcf.gz test.vcf.gz > TrueIndels.vcf

and before using vcftools, bgzip your vcf files and index with tabix.

ADD COMMENT
0
Entering edit mode
9.0 years ago
Tej Sowpati ▴ 250

Hi,

Is it mandatory that you have to use R or MATLAB? If not, have a look at vcftools. You can compare and filter vcf files in several different ways, and one of the most frequent usages is to identify common indels.

Cheers

ADD COMMENT
0
Entering edit mode
9.0 years ago

The GRanges and VariantAnnotation Bioconductor packages make this pretty easy in R. Roughly, you will need to:

  1. Use readVcf to read in your two VCF files.
  2. Use findOverlaps(...., type="equal").

Alternatively, simply concatenate the chromosome, start, and end into a string for each variant for each VCF file. Then, use a simple intersect on the two character vectors.

ADD COMMENT

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6