Best strategy starting from allelic chromosomes instead of reference plus VCF
0
0
Entering edit mode
4.2 years ago
alfonsosaera ▴ 50

Hi all,

I am trying to do genotyping (DNAseq samples) and allele specific expression (RNAseq samples) of a recently assembled plant (alfalfa (Medicago sativa L.)), nature paper here , this is a tetraploid species so everything gets a little more complicated.

Instead of having a reference genome and a VCF, the authors generated an allele-aware chromosome-level genome assembly for the cultivated alfalfa consisting of 32 allelic chromosomes (8 chr x 4 copies), chrom lengths:

>chr1.1
82459472
>chr1.2
86910131
>chr1.3
79881340
>chr1.4
88815615
>chr2.1
76462061
>chr2.2
74215936
>chr2.3
76375162
>chr2.4
76750018
>chr3.1
93149498
>chr3.2
93375939
>chr3.3
96157931
>chr3.4
100414524
>chr4.1
90245664
>chr4.2
93947428
>chr4.3
90228617
>chr4.4
90896203
>chr5.1
81211777
>chr5.2
84165483
>chr5.3
80712490
>chr5.4
78626892
>chr6.1
80303593
>chr6.2
89579199
>chr6.3
84649260
>chr6.4
64534737
>chr7.1
88407277
>chr7.2
93528358
>chr7.3
91580142
>chr7.4
94657719
>chr8.1
87242343
>chr8.2
84274390
>chr8.3
82440740
>chr8.4
81801543

I recently found vg tools and I thing it solves most of the problems I have but I am not sure how to make the analysis. Here is what I have now:

1.- Generate a graph for each chromosome set and then combine them.

Based on the github page I though about something like:

./vg msga -f chr2.fasta -t 50 -k 16 -D -A | ./vg mod -U 10 - -t 50 | ./vg mod -c - -t 50 > chr2.vg

But reading the wiki, a modification of this seems better:

vg msga -f MHC.fa \ # use MHC.fa as our input
    -b ref \        # the >ref sequence is our basis
    -B 128 \        # align 128-mer bands of the sequences
    -K 11 \         # use 16-mers as the basis for our indexing
    -X 2 \          # double twice in GCSA2, generating a 44-mer deBruijn graph
    -E 3 \          # remove edges from the graph that would allow us to cross 3 bifurcations in 11bp
    -G -S 0.95 \    # greedily accept alignments when they match at 95% of our maximum alignment score
    -H 5 \          # ignore MEMs which have more than 5 hits
    -D \            # basic debugging output, describing progression through the process
    >MHC.vg         # write the output to MHC.vg

vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg

How do I set parameters like -B, -K, -E ...?

Now I would have to fuse/concatenate the 8 chromosome_graph.vg to get a genome_graph.vg using vg concat

2.- Augment the graph with info from DNA-seq samples

However reading the github page I am not sure how to do it.

3.- Map RNA-seq samples to augmented graph

I am thinking in a modification of this:

# map reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam

# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -x x.xg -b aln.gam > aln.bam

# or alternatively, surject them to BAM in the call to map
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam

4.- Perform allele specific expression

My idea was to get a VCF from the augmented graph (aug.graph.vcf), but reading the Calling variants using read support section I don't really know how to do it. Next I would extract a linear reference (aug.graph.linear.fasta) from the graph to use GATK ASEReadCounter using the bam from step 3 (aln.bam) as below:

gatk ASEReadCounter -R aug.graph.linear.fasta -I aln.bam -V aug.graph.vcf -O output.table

Thanks for getting to here, I have 3 general questions:

1.- Do you find this approach suitable? would it be better to focus only in coding regions?

2.- I have a modest server, not a cluster or supercomputer, with ~ 200GB RAM and ~40 processors, will I be able to run vg tools for my samples in this machine?

3.- I found toil-vg which seems to make life easier reducing command calls and parallelize vg tools. Is this something I could use in my machine?

4.- At the end of the analysis I need to check gene expression so I would need to add gene annotations to the genome graph from a GFF of the assembled genome, is there a command for this? alternatively, is there a way to do something like a liftover between the published genome and the resulting graph?

Thank you very much!!!!

Alfonso

vg • 898 views
ADD COMMENT

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6