Following on from some advice I received in a previous post, I've attempted to use GraphAligner
and vg surject
to map long reads to the T2T reference genome, and it seems to be resulting in nonsense outputs - a much smaller BAM than a comparable linearly aligned BAM from minimap2
but with 16x more alignments (631.8 million versus 39.6 million, from a source fasta containing 29.8 million reads) and yet when I attempt to call SVs on it using sniffles2
and cuteSV
I get zero results. This makes no sense, so I'm wondering what exactly is going on here?
I'm fairly certain it's not sniffles2
or cuteSV
, they did successfully call variants when used on the output from minimap2
based on the same input data. I'm not sure what I can adjust on the GraphAligner
or vg surject
front, I'm using default settings (and in the case of the former, its presets for vg
-style graph inputs). The one thing I know is going on is that vg surject
is hitting a hard limit on alignment size on some subgraphs and stopping early for some reads - how many, I don't know, because it refuses to give further warnings past the first.
My pipeline (with the data sources and software versions listed) is here, and a sample error log from the GraphAligner
and vg surject
section of the process is here. I'd appreciate advice from anyone who has more experience with these tools - have a sneaking suspicion that I'm running into the same issue of "these tools can't quite do what you're asking of them" from before.
One thing to check out is whether the GAM file from
GraphAligner
looks at all suspicious. To look at a few alignments you can usevg view -aj alns.gam | head
Will do - I looked at a sampling of the BAM alignments wondering the same thing, and they looked similar enough to the ones from the linear BAM just on sight (then again, I wasn't entirely sure what would single an alignment out as faulty, so it's possible I missed something). Is there anything specific I should be looking for in the GAM?
Edit: nothing looks weird (well, I don't really know what "weird" is, but it's at least internally consistent from what I can tell) - I'm thinking there's something weird about the surject step and how it's moving from GAM to BAM. Checked the BAM headers out of curiosity and I'm pretty sure that's not where the problem is. Aside from some messiness from merging, they contain the same flags for chromosome name and length as the
minimap2
-generated BAM, as well as ID tags fromvg
.Edit 2: also to clarify, the alignments in the BAM from this pipeline look similar to what I'm seeing in the linear BAM I made for comparison, but for some reason downstream tools aren't able to call anything on it.
Did you ever solve this?
GraphAligner doesn't retain the base quality scores in the GAM output, see: https://github.com/maickrau/GraphAligner/issues/84
If the variant callers you're using require those scores, then maybe that's why they fail?