Best Way To Compare Output Snp/Indels From Different Software?
6
6
Entering edit mode
13.6 years ago
Hmm ▴ 500

I have a list of SNPs and indels from 4 different software.

IndelGenotyper (broad tool)—for calling somatic indels Bambino—for calling SNP/Indels Somaticsniper—SNP calling Varscan – indel/snp/LOH

Now my task is to compare SNP and indels from different software. I need some input as to how to do that. I only have output in VCF format for IndelGenotyper. The rest of the software have the following output format:

IndelGenotyper (VCF):

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    normal    tumor    
1    177    .    A    AC    .    .    N_AC=295,310;N_DP=867;N_MM=3.1084745,4.6678634;N_MQ=18.840677,20.062836;N_NQSBQ=26.988136,22.908127;N_NQSMM=0.0111864405,0.13539149;N_SC=12,283,116,441;T_AC=234,239;T_DP=691;T_MM=3.2051282,4.7743363;T_MQ=20.816238,21.225664;T_NQSBQ=27.663815,23.273888;T_NQSMM=0.011976048,0.13911061;T_SC=9,225,85,367    GT:GQ    0/1:0    0/1:0
1    230    .    AC    A    .    .    N_AC=218,227;N_DP=663;N_MM=4.03211,4.456422;N_MQ=16.825687,20.529816;N_NQSBQ=32.26916,29.760443;N_NQSMM=0.002770083,0.04280557;N_SC=18,200,73,363;T_AC=162,170;T_DP=465;T_MM=4.351852,4.745763;T_MQ=18.141975,21.064407;T_NQSBQ=32.513077,30.064604;T_NQSMM=0.0124533,0.045261122;T_SC=5,157,34,261    GT:GQ    0/1:0    0/1:0
1    247    .    TA    T    .    .    N_AC=156,162;N_DP=392;N_MM=4.467949,4.3130436;N_MQ=20.666666,19.878262;N_NQSBQ=30.026958,30.963959;N_NQSMM=0.019897304,0.07715736;N_SC=31,125,65,165;T_AC=104,113;T_DP=267;T_MM=4.2115383,4.766234;T_MQ=23.403847,20.603895;T_NQSBQ=31.615385,30.121395;T_NQSMM=0.011538462,0.06525038;T_SC=16,88,24,130    GT:GQ    0/1:0    0/1:0
1    254    .    TA    T    .    .    N_AC=123,144;N_DP=294;N_MM=4.593496,4.613333;N_MQ=21.04878,21.833334;N_NQSBQ=30.16913,29.987564;N_NQSMM=0.018883415,0.0987564;N_SC=27,96,63,87;T_AC=96,108;T_DP=195;T_MM=4.8854165,4.2988505;T_MQ=24.5625,22.16092;T_NQSBQ=30.090431,29.74282;T_NQSMM=0.027339643,0.086142324;T_SC=19,77,23,64    GT:GQ    0/1:0    0/1:0
1    3820    .    TC    T    .    .    N_AC=8,8;N_DP=13;N_MM=1.5,1.8;N_MQ=14.5,12.0;N_NQSBQ=33.7125,31.763159;N_NQSMM=0.0,0.0;N_SC=0,8,0,5;T_AC=8,8;T_DP=12;T_MM=1.25,1.25;T_MQ=10.625,13.25;T_NQSBQ=31.725,31.914286;T_NQSMM=0.0,0.028571429;T_SC=0,8,0,4    GT:GQ    0/1:0    0/1:0
1    235278    .    C    CT    .    .    N_AC=2,2;N_DP=18;N_MM=1.5,0.4375;N_MQ=20.5,12.0625;N_NQSBQ=18.45,33.529034;N_NQSMM=0.0,0.006451613;N_SC=1,1,15,1;T_AC=5,5;T_DP=15;T_MM=1.6,1.1;T_MQ=26.0,26.5;T_NQSBQ=34.2,27.813187;T_NQSMM=0.0,0.054945055;T_SC=5,0,6,4    GT:GQ    0/1:0    0/1:0

somaticsniper:

Chromosome    Position    Reference base    IUB genotype of tumor    Somatic Score    Tumor Consensus quality    Tumor SNV quality    Tumor RMS mapping quality    Depth in tumor (# of reads crossing the position)    Depth in normal (# of reads crossing the position)
1    109    A    W    0    228    228    25    2345    3196
1    177    A    M    0    228    228    23    1399    1770
1    180    T    Y    0    228    228    22    1410    1781
1    250    A    M    1    21    21    24    452    598
1    257    A    M    1    9    9    25    391    508
1    291    C    Y    0    59    176    25    224    312

Bambino:

NormalSample    TumorSample    Name    Chr    Pos    Type    Size    Coverage    Percent_alternative_allele    Chr_Allele    Alternative_Allele    P-value    Text    unique_alternative_ids    reference_normal_count    reference_tumor_count    alternative_normal_count    alternative_tumor_count    count_ref_normal_fwd    count_ref_normal_rev    count_ref_tumor_fwd    count_ref_tumor_rev    count_var_normal_fwd    count_var_normal_rev    count_var_tumor_fwd    count_var_tumor_rev    alternative_fwd_count    alternative_rev_count    alternative_bidirectional_confirmation    broad_coverage    broad_reference_normal_count    broad_reference_tumor_count    broad_alternative_normal_count    broad_alternative_tumor_count    unique_alt_read_starts    unique_alt_read_starts_fwd    unique_alt_read_starts_rev
P7norm.bam    P7tum.bam    chr1.109    chr1    109    SNP    1    2808    0.30477223    A    T    1.0    CCTAACCCTAACCCTAACCC[A/T]ACCCTAACCCTAACCCTAAC    796    1113    810    484    359    570    543    404    406    217    267    170    189    387    456    1    3760    1513    1122    661    464    89    80    70
P7norm.bam    P7tum.bam    chr1.147    chr1    147    deletion    1    1302    0.343318    C        1.0        445    410    347    284    163    219    191    182    165    20    264    15    148    35    412    1    1513    520    465    333    195    85    27    76
P7norm.bam    P7tum.bam    chr1.178    chr1    178    insertion    1    1805    0.46371192        C    1.0        837    566    426    477    360    137    429    99    327    19    458    12    348    31    806    1    2491    838    681    544    428    75    25    72
P7norm.bam    P7tum.bam    chr1.231    chr1    231    deletion    1    1578    0.42712295    C        1.0        674    515    337    387    287    58    457    23    314    25    362    6    281    31    643    1    1819    660    460    398    301    75    25    63
P7norm.bam    P7tum.bam    chr1.248    chr1    248    deletion    1    814    0.4152334    A        1.0        338    243    183    194    144    31    212    15    168    26    168    12    132    38    300    1    931    324    248    207    152    77    26    66
P7norm.bam    P7tum.bam    chr1.255    chr1    255    deletion    1    509    0.45972496    A        1.0        234    149    96    123    111    34    115    11    85    14    109    11    100    25    209    1    625    224    147    135    119    69    20    60
P7norm.bam    P7tum.bam    chr1.304    chr1    304    insertion    1    83    0.25301206        T    1.0        21    44    19    14    7    14    30    3    16    8    6    4    3    12    9    1    380    199    158    16    7    19    12    9
P

Varscan (Indel):

chrom    position    ref    var    normal_reads1    normal_reads2    normal_var_freq    normal_gt    tumor_reads1    tumor_reads2    tumor_var_freq    tumor_gt    somatic_status    variant_p_value    somatic_p_value    tumor_reads1_plus    tumor_reads1_minus    tumor_reads2_plus    tumor_reads2_minus
1    146    A    -C    533    270    33.62%    */-C    409    161    28.25%    */-C    Germline    5.947213588506911E-148    0.9853916902840889    165    244    10    151
1    177    A    +C    436    295    40.36%    */+C    344    234    40.48%    */+C    Germline    2.573218751523094E-189    0.5036502499047071    49    295    9    225
1    230    A    -C    613    218    26.23%    */-C    436    162    27.09%    */-C    Germline    8.727652916131424E-128    0.381197718905699    36    400    5    157
1    247    T    -A    329    156    32.16%    */-A    257    104    28.81%    */-A    Germline    2.9363752902695665E-89    0.869105076258226    31    226    16    88
1    254    T    -A    249    123    33.06%    */-A    165    96    36.78%    */-A    Germline    1.2718779840140742E-76    0.18856251770578447    32    133    19    77
1    329    A    -C    113    26    18.71%    */-C    68    19    21.84%    */-C    Germline    2.491432783762119E-15    0.34120745355261556    28    40    9    10

Varscan (SNP):

chrom    position    ref    var    normal_reads1    normal_reads2    normal_var_freq    normal_gt    tumor_reads1    tumor_reads2    tumor_var_freq    tumor_gt    somatic_status    variant_p_value    somatic_p_value    tumor_reads1_plus    tumor_reads1_minus    tumor_reads2_plus    tumor_reads2_minus
1    109    A    T    749    295    28.26%    W    559    199    26.25%    W    Germline    1.6489449904910897E-166    0.8400429841827789    352    207    121    78
1    180    T    C    463    184    28.44%    Y    379    152    28.63%    Y    Germline    5.019474360557042E-114    0.4973950760340981    49    330    25    127
1    291    C    T    50    97    65.99%    Y    28    67    70.53%    Y    Germline    4.350249434195174E-69    0.27609006399506    7    21    32    35
1    297    C    T    53    83    61.03%    Y    31    59    65.56%    Y    Germline    5.264922068371216E-58    0.29228247724476564    7    24    27    32
1    303    C    T    55    65    54.17%    Y    25    50    66.67%    Y    Germline    5.348385335162754E-46    0.056903610901911816    8    17    24    26
1    309    C    T    49    66    57.39%    Y    25    48    65.75%    Y    Germline    4.640899271738838E-46    0.16096023989817923    7    18    21    27

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

i guess comparing SNP should be straight forward as i look at chromosomal position and the SNP letter (from/to) and if that matches then the SNPs match. Am i correct? I have no idea how to do that for indels? Any comments are highly appreciated.

Thank You All.

merge snp indel output • 8.5k views
ADD COMMENT
2
Entering edit mode
13.5 years ago
lh3 33k

When you come to somatic calls, essentially there is no answer because little is known about the characteristics of somatic mutations for a particular cancer. All the answers above are mainly applicable to whole-genome variant discovery. There might be two very subtle criteria: a) your calls are not clustered; b) you get not more than a few thousand calls.

In general, you cannot just run several callers on the same alignment and expect them to give you a reasonable result. They frequently have >90% false positive rate (FPR). The problem is not caused by the caller, but by the alignment. My suggestion is to do two rounds of alignments using distinct algorithms, for example bwa and bwasw (or stampy+ssaha2). Frequently you will find the intersection is much smaller than one of them. This

ADD COMMENT
1
Entering edit mode
13.5 years ago
mylons ▴ 130

You should really just check the common SNP and indel databases, ie dbSNP. Look at the concordance for each SNP caller. That's an objective way of comparing them.

ADD COMMENT
0
Entering edit mode

And let us know how it goes...

ADD REPLY
1
Entering edit mode
13.5 years ago
Brett Thomas ▴ 300

One tool you can use is PLINK/SEQ. (Disclaimer: I work on the project.) It's under active development, but can be used to compare output if you have VCF or PED files. One method would be to use the v-stats command to calculate, for example, the Ti/Tv ratio. (see the tutorial)

ADD COMMENT
1
Entering edit mode
13.5 years ago
Rm 8.3k

Comparing multiple variant calls to a set of known variant loci and their alleles, such as dbSNP or the universal HapMap database is will be a good option. To obtain intersections of multiple variants use bedtools "intersectBed".

Indels can also be processed in similar fashion as variants; but length of indels may vary a given location as predicted by multiple tools. in that case you can also obtain % of overlap.

ADD COMMENT
0
Entering edit mode

@RM -- i will miss any novel SNPs this way. I was thinking of first intersect and then compare to dbsnp and get a list of known vs unknown.

ADD REPLY
0
Entering edit mode

SNV's unmapped by the dbSNP will be unkwon or novel SNV's...

ADD REPLY
1
Entering edit mode
13.3 years ago
Bo Peng ▴ 10

A new tool that seem to be perfect for this job is "variant tools" varianttools.sourceforge.net). Although some of the formats are not yet supported, you can import your variants using customized format specification. Once all the variants are imported, you can separate them by source using command "vtools select --samples" and create variant tables for each file format. You can then compare these tables using command "vtools compare". A lot more can be done using other commands.

ADD COMMENT
0
Entering edit mode

Any publication planned?

ADD REPLY
0
Entering edit mode

Under review. :-)

ADD REPLY
0
Entering edit mode

Still under review. :-)

We understand that there are many different formats from different calling algorithms so we have developed an "input format specification" format that can be used to describe and import these formats (http://varianttools.sourceforge.net/Main/Documentation). I have just finished formats for Illumina CASAVA 1.8 and I am working on importing our BGI data. Please feel free to try it out. We will be happy to post your .fmt files so that others can use them.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
13.2 years ago
Bo Peng ▴ 10

We have recently encountered a similar problem and have put together a tutorial on how to import and compare variants called by different platforms: http://varianttools.sourceforge.net/Tutorial/CompareVariants .

ADD COMMENT

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6