Hi all,
I am trying to do genotyping (DNAseq samples) and allele specific expression (RNAseq samples) of a recently assembled plant (alfalfa (Medicago sativa L.)), nature paper here , this is a tetraploid species so everything gets a little more complicated.
Instead of having a reference genome and a VCF, the authors generated an allele-aware chromosome-level genome assembly for the cultivated alfalfa consisting of 32 allelic chromosomes (8 chr x 4 copies), chrom lengths:
>chr1.1
82459472
>chr1.2
86910131
>chr1.3
79881340
>chr1.4
88815615
>chr2.1
76462061
>chr2.2
74215936
>chr2.3
76375162
>chr2.4
76750018
>chr3.1
93149498
>chr3.2
93375939
>chr3.3
96157931
>chr3.4
100414524
>chr4.1
90245664
>chr4.2
93947428
>chr4.3
90228617
>chr4.4
90896203
>chr5.1
81211777
>chr5.2
84165483
>chr5.3
80712490
>chr5.4
78626892
>chr6.1
80303593
>chr6.2
89579199
>chr6.3
84649260
>chr6.4
64534737
>chr7.1
88407277
>chr7.2
93528358
>chr7.3
91580142
>chr7.4
94657719
>chr8.1
87242343
>chr8.2
84274390
>chr8.3
82440740
>chr8.4
81801543
I recently found vg tools and I thing it solves most of the problems I have but I am not sure how to make the analysis. Here is what I have now:
1.- Generate a graph for each chromosome set and then combine them.
Based on the github page I though about something like:
./vg msga -f chr2.fasta -t 50 -k 16 -D -A | ./vg mod -U 10 - -t 50 | ./vg mod -c - -t 50 > chr2.vg
But reading the wiki, a modification of this seems better:
vg msga -f MHC.fa \ # use MHC.fa as our input
-b ref \ # the >ref sequence is our basis
-B 128 \ # align 128-mer bands of the sequences
-K 11 \ # use 16-mers as the basis for our indexing
-X 2 \ # double twice in GCSA2, generating a 44-mer deBruijn graph
-E 3 \ # remove edges from the graph that would allow us to cross 3 bifurcations in 11bp
-G -S 0.95 \ # greedily accept alignments when they match at 95% of our maximum alignment score
-H 5 \ # ignore MEMs which have more than 5 hits
-D \ # basic debugging output, describing progression through the process
>MHC.vg # write the output to MHC.vg
vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg
How do I set parameters like -B
, -K
, -E
...?
Now I would have to fuse/concatenate the 8 chromosome_graph.vg to get a genome_graph.vg using vg concat
2.- Augment the graph with info from DNA-seq samples
However reading the github page I am not sure how to do it.
3.- Map RNA-seq samples to augmented graph
I am thinking in a modification of this:
# map reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam
# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -x x.xg -b aln.gam > aln.bam
# or alternatively, surject them to BAM in the call to map
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam
4.- Perform allele specific expression
My idea was to get a VCF from the augmented graph (aug.graph.vcf), but reading the Calling variants using read support section I don't really know how to do it. Next I would extract a linear reference (aug.graph.linear.fasta) from the graph to use GATK ASEReadCounter using the bam from step 3 (aln.bam) as below:
gatk ASEReadCounter -R aug.graph.linear.fasta -I aln.bam -V aug.graph.vcf -O output.table
Thanks for getting to here, I have 3 general questions:
1.- Do you find this approach suitable? would it be better to focus only in coding regions?
2.- I have a modest server, not a cluster or supercomputer, with ~ 200GB RAM and ~40 processors, will I be able to run vg tools for my samples in this machine?
3.- I found toil-vg which seems to make life easier reducing command calls and parallelize vg tools. Is this something I could use in my machine?
4.- At the end of the analysis I need to check gene expression so I would need to add gene annotations to the genome graph from a GFF of the assembled genome, is there a command for this? alternatively, is there a way to do something like a liftover between the published genome and the resulting graph?
Thank you very much!!!!
Alfonso