Hello,
DISCOVAR is a tool that can perform variant discovery as well as de novo assemblies.
I am interested in their variant discovery features, because it is very easy to use after successfully running it once. The good thing about their variant discovery is that it performs phasing of nearby SNPs in BLOCKS, and it outputs the BLOCKS with phase information in ".variant" format:
$ head -n 40 ../GDR18/GDR18_NODE1_40000-120000.final.variant
#BLOCK ID PATHS
#VAR ID LOCATION REF ALT PHASE INFO
BLOCK 1 51+|52+
PATH_QSUM 0|0|ref=0
VAR 1 NODE_1_length_193190_coverage_19.3759_GC_24.97:40231 C T -|+ PROB= 1/1
VAR 2 NODE_1_length_193190_coverage_19.3759_GC_24.97:40279 G C -|+ PROB= 1/1
VAR 3 NODE_1_length_193190_coverage_19.3759_GC_24.97:40280 C T -|+ PROB= 1/1
VAR 4 NODE_1_length_193190_coverage_19.3759_GC_24.97:40399 G A -|+ PROB= 1/1
VAR 5 NODE_1_length_193190_coverage_19.3759_GC_24.97:40593 C T -|+ PROB= 1/1
VAR 6 NODE_1_length_193190_coverage_19.3759_GC_24.97:40642 A G -|+ PROB= 1/1
VAR 7 NODE_1_length_193190_coverage_19.3759_GC_24.97:40749 T C -|+ PROB= 1/1
VAR 8 NODE_1_length_193190_coverage_19.3759_GC_24.97:40790 C T -|+ PROB= 1/1
VAR 9 NODE_1_length_193190_coverage_19.3759_GC_24.97:40960 A G +|- PROB= 1/1
VAR 10 NODE_1_length_193190_coverage_19.3759_GC_24.97:40966 A G +|- PROB= 1/1
VAR 11 NODE_1_length_193190_coverage_19.3759_GC_24.97:41083 G A -|+ PROB= 1/1
VAR 12 NODE_1_length_193190_coverage_19.3759_GC_24.97:41103 T A -|+ PROB= 1/1
VAR 13 NODE_1_length_193190_coverage_19.3759_GC_24.97:41138 T A -|+ PROB= 1/1
VAR 14 NODE_1_length_193190_coverage_19.3759_GC_24.97:41243 T C +|- PROB= 1/1
VAR 15 NODE_1_length_193190_coverage_19.3759_GC_24.97:41354 C T -|+ PROB= 1/1
VAR 16 NODE_1_length_193190_coverage_19.3759_GC_24.97:41388 A G -|+ PROB= 1/1
BLOCK 2 112+|113+,109+
PATH_QSUM 0|0|ref=0
VAR 17 NODE_1_length_193190_coverage_19.3759_GC_24.97:41687 T G -|+ PROB= 1/1
VAR 18 NODE_1_length_193190_coverage_19.3759_GC_24.97:41737 T C -|+ PROB= 1/1
VAR 19 NODE_1_length_193190_coverage_19.3759_GC_24.97:41756 CTCT C -|+ PROB= 1/1
VAR 20 NODE_1_length_193190_coverage_19.3759_GC_24.97:41789 G A -|+ PROB= 1/1
VAR 21 NODE_1_length_193190_coverage_19.3759_GC_24.97:41803 G T -|+ PROB= 1/1
VAR 22 NODE_1_length_193190_coverage_19.3759_GC_24.97:41881 T G -|+ PROB= 1/1
VAR 23 NODE_1_length_193190_coverage_19.3759_GC_24.97:41948 T A -|+ PROB= 1/1
BLOCK 3 115+|114+
PATH_QSUM 0|0|ref=0
VAR 24 NODE_1_length_193190_coverage_19.3759_GC_24.97:42482 TAAATGTTT+ T -|+ PROB= 1/1 REF=TAAATGTTTCTATGGATTTAC;ALT=T
VAR 25 NODE_1_length_193190_coverage_19.3759_GC_24.97:42508 A C -|+ PROB= 1/1
VAR 26 NODE_1_length_193190_coverage_19.3759_GC_24.97:42510 T C -|+ PROB= 1/1
VAR 27 NODE_1_length_193190_coverage_19.3759_GC_24.97:42511 T C -|+ PROB= 1/1
VAR 28 NODE_1_length_193190_coverage_19.3759_GC_24.97:42515 A T -|+ PROB= 1/1
VAR 29 NODE_1_length_193190_coverage_19.3759_GC_24.97:42522 A G -|+ PROB= 1/1
VAR 30 NODE_1_length_193190_coverage_19.3759_GC_24.97:42542 G T -|+ PROB= 1/1
I have simplified this format for myself, since I merely care about the phase of SNPs:
$ head -n 25 GDR18_NODE1_40000-120000.variant
#BLOCK ID
#VAR
BLOCK 1
PATH_QSUM 0|0|ref=0
VAR_1 NODE_1_length_193190_coverage_19.3759_GC_24.97:40231 -|+
VAR_2 NODE_1_length_193190_coverage_19.3759_GC_24.97:40279 -|+
VAR_3 NODE_1_length_193190_coverage_19.3759_GC_24.97:40280 -|+
VAR_4 NODE_1_length_193190_coverage_19.3759_GC_24.97:40399 -|+
VAR_5 NODE_1_length_193190_coverage_19.3759_GC_24.97:40593 -|+
VAR_6 NODE_1_length_193190_coverage_19.3759_GC_24.97:40642 -|+
VAR_7 NODE_1_length_193190_coverage_19.3759_GC_24.97:40749 -|+
VAR_8 NODE_1_length_193190_coverage_19.3759_GC_24.97:40790 -|+
VAR_9 NODE_1_length_193190_coverage_19.3759_GC_24.97:40960 +|-
VAR_10 NODE_1_length_193190_coverage_19.3759_GC_24.97:40966 +|-
VAR_11 NODE_1_length_193190_coverage_19.3759_GC_24.97:41083 -|+
VAR_12 NODE_1_length_193190_coverage_19.3759_GC_24.97:41103 -|+
VAR_13 NODE_1_length_193190_coverage_19.3759_GC_24.97:41138 -|+
VAR_14 NODE_1_length_193190_coverage_19.3759_GC_24.97:41243 +|-
VAR_15 NODE_1_length_193190_coverage_19.3759_GC_24.97:41354 -|+
VAR_16 NODE_1_length_193190_coverage_19.3759_GC_24.97:41388 -|+
BLOCK 2
PATH_QSUM 0|0|ref=0
VAR_17 NODE_1_length_193190_coverage_19.3759_GC_24.97:41687 -|+
VAR_18 NODE_1_length_193190_coverage_19.3759_GC_24.97:41737 -|+
...
Now the problem that I have is that I need to compare "variant" files between different samples. I need to look at how phases change, and information being in separate files is not very practical to do this, so I need to merge these variant files.
However, as you may guess, although the samples have 70% of SNPs in the same locations, different coverage of samples and different SNPs will change the BLOCK structure between samples. I am wondering if there is a way to align multiple variant files into one?
For instance, I picture the output to be along these lines:
GDR18_BLOCK_2 GDR23_BLOCK_2
VAR_19 NODE_1_length_193190_coverage_19.3759_GC_24.97:41756 NODE_1_length_193190_coverage_19.3759_GC_24.97:41756
VAR_20 NODE_1_length_193190_coverage_19.3759_GC_24.97:41789 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:41789 -|+
VAR_21 NODE_1_length_193190_coverage_19.3759_GC_24.97:41803 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:41803 -|+
VAR_22 NODE_1_length_193190_coverage_19.3759_GC_24.97:41881 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:41881 -|+
VAR_23 NODE_1_length_193190_coverage_19.3759_GC_24.97:41948 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:41948 -|+
GDR18_BLOCK_3 GDR23_BLOCK_3
VAR_24 NODE_1_length_193190_coverage_19.3759_GC_24.97:42482 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42482 -|+|+
VAR_25 NODE_1_length_193190_coverage_19.3759_GC_24.97:42508 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42508 -|+|+
VAR_26 NODE_1_length_193190_coverage_19.3759_GC_24.97:42510 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42510 -|+|+
VAR_27 NODE_1_length_193190_coverage_19.3759_GC_24.97:42511 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42511 -|+|+
VAR_28 NODE_1_length_193190_coverage_19.3759_GC_24.97:42515 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42515 -|+|+
VAR_29 NODE_1_length_193190_coverage_19.3759_GC_24.97:42522 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42522 -|+|+
VAR_30 NODE_1_length_193190_coverage_19.3759_GC_24.97:42542 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42542 -|+|+
VAR_31 NODE_1_length_193190_coverage_19.3759_GC_24.97:42548 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42548 -|+|+
VAR_32 NODE_1_length_193190_coverage_19.3759_GC_24.97:42594 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42594 -|+|+
VAR_33 NODE_1_length_193190_coverage_19.3759_GC_24.97:42816 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42816 -|+|+
VAR_34 NODE_1_length_193190_coverage_19.3759_GC_24.97:42864 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:42864 -|+|+
NODE_1_length_193190_coverage_19.3759_GC_24.97:42869 -|+|-
NODE_1_length_193190_coverage_19.3759_GC_24.97:42870 -|+|-
GDR18_BLOCK_4 GDR23_BLOCK_4
VAR_35 NODE_1_length_193190_coverage_19.3759_GC_24.97:44623 -|+ NODE_1_length_193190_coverage_19.3759_GC_24.97:44623 -|+
VAR_36 NODE_1_length_193190_coverage_19.3759_GC_24.97:44873 -|+
VAR_37 NODE_1_length_193190_coverage_19.3759_GC_24.97:44948 -|+
Although DISCOVAR generates VCF files, they are not phased, which makes it impossible to use VCFTools for this purpose.
Thank you,
Adrian
Can you use join on the "VAR" field?
No, not really, because VAR are sequential in every isolate/sample (not the same between samples, useless field). I thought that maybe I can use join on the position of the SNP, after the : in NODE_1_length_193190_coverage_19.3759_GC_24.97:41756 but the problem is that I will not know what the block is in other samples.