denovo-assembly with novel genes inserted
1
0
Entering edit mode
7.8 years ago
aindap ▴ 120

I have a viral strain with non-viral sequences inserted in. I aligned to my reads to a reference strain. The reads that didn't align, I did a denovo assembly with SSAKE to identify my contigs of my non-viral sequences.

What I want is a final assembly of my virus strain where I know the non-viral sequences inserted in. I'm new to assembly and am uncertain how to do this. I would appreciate any help!

Assembly denovo assembly Ilumina • 1.9k views
ADD COMMENT
1
Entering edit mode
7.8 years ago

I would assemble the whole dataset, shred your reference, and find the parts of your assembly that don't match the reference. For example, using BBMap, and assuming you have 2x150bp reads:

Preprocessing:

bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq minlen=90 ktrim=r k=23 mink=11 hdist=1 tbo tpe ref=adapters.fa maxns=0 qtrim=r trimq=10

bbduk.sh in=trimmed.fq out=filtered.fq ref=phix174_ill.ref.fa.gz,sequencing_artifacts.fa.gz k=31

bbmerge.sh in=filtered.fq out=ecco.fq ecco mix strict adapters=default

tadpole.sh in=ecco.fq out=ecct.fq ecc

Is all that really necessary? Well, no, not if you can get a perfect assembly without it; but you usually can't. There's more preprocessing you can do as well, but for a virus, this is probably sufficient. Incidentally, all of the reference fastas mentioned are in /bbmap/resources/

Assembly:

tadpole.sh in=ecct.fq out=contigs.fa k=75

Use Tadpole or SSAKE or whatever gives the best assembly, but Tadpole usually works well for viruses. If your reads are longer than 150bp, use a longer kmer; half the read length is usually fine.

Post-assembly:

shred.sh in=reference.fa length=300 overlap=280 minlength=200 out=shreds.fa

bbmap.sh in=shreds.fa ref=contigs.fa out=mapped.sam bincov=bincov.txt covbinsize=10 ambig=all delcov=f

callvariants.sh in=mapped.sam out=vars.vcf ref=contigs.fa

bincov.txt now contains the locations of your assembly covered by the reference, so the non-covered locations are the inserts. Additionally, vars.vcf will contain the exact insert sequences and positions; they will be considered deletion events (since they are present in the "reference" [contigs.fa] but not the "reads" [shreds.fa]). Of course, some of them will be reverse-complemented since the assembled contigs have a random strand. Mapping the contigs to the original reference might give you the insertions as insert events with the original reference coordinates, so it's worth trying, but it's much easier to call deletions than insertions so that approach is more reliable.

ADD COMMENT
1
Entering edit mode

Thank you so much!!! tadpole worked much faster and I got longer contigs

The bbmap step helped me find the insert point of my novel genes in the virus

I'm new to assembly, but this pipeline was just what I was looking for.

ADD REPLY

Login before adding your answer.

Traffic: 2524 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6