I would like to ask for your opinions. I have a WGS data set of mouse with a transgene inserted into it at an unidentified location causing a specific unexpected phenotype.
We would like to identify the insertion position(s).
I was thinking about trying a de-novo sequencing (SOAPdenovo) but I'm not sure if this is the correct approach. By de-novo sequencing I was hoping of identifying the transcripts containing the insertion site (it is ~6.2mb in size) and identify where it was lodged into the genome (mouse as a reference organism).
Do you think this can be a good solution?
Can anyone recommend a better approach or tool for this kind of analysis?
Hi Cameron,
I have tried
gridss
before (it was still version 1.5.1 back then) and have had some good, mixed experience with it. We have got some nice results which showed us a possibility of one specific (or two different, we couldn't quite figure out the results) insertion site(s). Do you think I should try the new version (v. 2.2.0) again? We did exactly what you listed above (merging the genomes, masking the regions in the mouse chromosomes, alignment, SV->
vcf file).I was thinking the de-novo assembly would give me a more straightforward results. or maybe even using your own tool
socrates
to look for exactly that.Sorry for the delayed response.
I do. V2.0 added single breakend reporting which can be quite helpful in this sort of analysis. Whilst my collaborators supply an expected construct when engaging me, I've yet to have a project where the construct I've been given has been correct. One transgene included a PhiX component that they forgot to tell me about, another sent me the full sequence for the human gene they'd inserted which I then had to trace through all the exon to exon SV to validate it was the correct transcript, and so on.
Although single breakend calls have an intrinsically higher FDR that breakpoint call, they're extremely useful in determining a) whether you're missing bits of your construct, and b) whether you have a insertion site in repetitive sequence.
You'll still need to do the post-assembly steps of identifying the contigs containing the construct and aligning the contigs back to the reference. If you have multiple insertion sites, this will result in branches in the assembly graph which will split your contigs at the insertion sites thus putting you right back where you started.
Hi Daniel, sorry for the late reply and thanks for answering me. I have done the analysis with the new version and got a vcf file. But the results don't defer much from the older run:
I removed all the rows with results from endogenous chromosomal regions and left only the two possible insertion positions on chromosome 7. This points though to a complex insertion behavior. the
LOW_QUAL
rows might be the results of low coverage at this region. It seems that the results hints toward a complex insertion of the transgene combined with duplication of several genomic parts.Unfortunately we can't really identify the correct structure after the insertion.
That does seem unusual. If you don't have any other compensating breakpoints, those calls indicate that the transgene is inserted on a double minute containing it and 7:28985942-36210528. The next steps I'd take would be to: