Assembling a new reference for a modified E.coli strain
1
1
Entering edit mode
8.1 years ago

Hi everyone!

I'm quite new to this genomics stuff, so sorry if this question seems silly (or just cheeky). I have looked for similar questions, but haven't found any (but I'm terrible at searching, though).

So. I have Illumina reads for a E.coli strain that has GFP inserted in it. I know roughly where the GFP should be, and I have the reference genome for the unmodified E.coli strain. I want to assemble a new reference for my modified strain.

From what I've seen so far, the idea would be to do a de novo assembly, then some scaffolding and mapping to the reference I have, using something like Mauve. But as my sequences map almost perfectly to the reference, and the only important breakpoint is the GFP insertion, can't I use this information to improve my assembly? If so, how?

Thanks! Pablo

Assembly genome next-gen • 1.8k views
ADD COMMENT
1
Entering edit mode

What kind of sequence data do you have and what theoretical coverage do you expect?
Assembling just your data is likely to give you a good assembly as long the the coverage for data is good and the libraries were well made.
Is the "reference" genome from unmodified strain the strain you started this experiment with? Reason I ask is because a lot of times strains used over time in labs "drift" from references available in NCBI and may not remain identical. If you suspect that to be the case then you may need to sequence your "unmodified" strain as well.

ADD REPLY
0
Entering edit mode

Yes, the unmodified genome is from the same strain we started the experiment with. Actually, if I map my reads to the reference genome for the unmodified strain, I get a great alignment (almost no SNPs), and the only part that's strange is where the GFP should be, because the aligner (I've used BWA) can't map the reads in that fragment.

The data is DNA, if that's what you're referring to. The coverage I get after mapping to the reference is 60 on average.

ADD REPLY
1
Entering edit mode

It should be easy to locate where the insertion is then. You will want to confirm with PCR (and some sanger sequencing) to confirm the exact location.

ADD REPLY
0
Entering edit mode

I've located the insertion. I'm not sure about how to proceed now. I have this idea of dividing my original in two bits: before and after the insertion. Then map my reads to the three genomes: (1) E.coli before insertion, (2) E.coli after insertion and (3) GFP. And assemble these three genomes together somehow. Is that a good way to go?

ADD REPLY
1
Entering edit mode

The insertion is a single copy of GFP?
If so you must have been able to precisely locate the point in the original sequence where it got inserted. Is that the case when you say "you have located the insertion"?
Unless you expect the genome to be rearranged elsewhere should the "mutant" (for lack of better descriptor) not be identical to the original with GFP inserted into the spot you identified?
In case you have not done this, you could align your reads to GFP. Find out the part of the reads that does not match GFP and then use that part to locate the precise point of insertion on the original genome.

ADD REPLY
0
Entering edit mode

The insertion is a single copy of GFP?

Yes

If so you must have been able to precisely locate the point in the original sequence where it got inserted. Is that the case when you say "you have located the insertion"?

I don't know the exact base, but I have a region of around 10 bp in the reference genome where my reads don't map to. And it's in the gene where I know the GFP was inserted.

Unless you expect the genome to be rearranged elsewhere should the "mutant" (for lack of better descriptor) not be identical to the original with GFP inserted into the spot you identified?

Yes, it should be identical.

In case you have not done this, you could align your reads to GFP. Find out the part of the reads that does not match GFP and then use that part to locate the precise point of insertion on the original genome.

Yes, doing that leaves me with a 10bp region where I know the GFP is inserted. But I think the GFP was inserted in a way the target protein changed, so maybe the region around the transformation doesn't map exactly to the reference? So I guess I have to do a reference guided assembly using both my reference genome and the GFP sequence.

ADD REPLY
1
Entering edit mode

But I think the GFP was inserted in a way the target protein changed, so maybe the region around the transformation doesn't map exactly to the reference? So I guess I have to do a reference guided assembly using both my reference genome and the GFP sequence.

But if the insertion occurred without major changes to sequence (either to GFP or your strain) you should be able to map the junction using reads that contain those insertion points. You must have reads that are part GFP and part your strain from the two ends of the insertion?

Would it not be more precise to sequence (and confirm) the insertion junction using PCR and sanger sequencing?

ADD REPLY
0
Entering edit mode

But if the insertion occurred without major changes to sequence (either to GFP or your strain) you should be able to map the junction using reads that contain those insertion points. You must have reads that are part GFP and part your strain from the two ends of the insertion?

Yes, I hope I can do that. It's just I'm so new to this I wasn't sure how exactly to do it (computationally, I mean).

Would it not be more precise to sequence (and confirm) the insertion junction using PCR and sanger sequencing?

I guess, but we have enough sequencing data to infer this from the reads. It's just that I don't know how to do it. Anyway, thanks a lot for your answers, you're being really helpful :)

ADD REPLY
1
Entering edit mode

I am not sure what aligner you are using but try this. Align to GFP alone. Use something like IGV to look at the alignments (you may have to right click and choose "show all bases"). The reads that align to GFP at the very beginning and at the end should have part of the bacterial reference. If you have enough coverage (sounds like you do), you will be able to find the exact insertion point by looking at the pileup of reads/sequence consensus.

ADD REPLY
0
Entering edit mode

Thanks! I did this and, after some tweaking (the sequence of GFP I was using from NCBI to be reversed in order for things to make sense) I've succeeded in assembling the whole genome. Thanks for your help!

ADD REPLY
2
Entering edit mode
8.1 years ago
5heikki 11k

It's called reference guided assembly. Many assemblers can do it..

ADD COMMENT
0
Entering edit mode

Thanks, I'll look into that ;)

However, I was under the impression that reference guided assembly used only one reference. And I wanted to know if there is any way I can include the sequence of GFP as help for the assembly too.

ADD REPLY

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6