How to collapse de novo contigs into a single sequence?
1
1
Entering edit mode
3.0 years ago

Hello,

I generated a series of contigs using Spades. But how do I get a single sequence out the thousands of entries I got? For instance, I have:

>NODE_1_length_79_cov_5.000000
TGGATTACAAAGTTACCTGTCAAACGGTGCAATGAAGCCAAGTTAGAACTCGTCAGAATG
AATATTATCAAGCAGCAGA
>NODE_2_length_78_cov_4169.000000
TGGATTACAAAGTTACCTGTCAAACGGTGCAATGAAGCCAAGTTAGAACTCGTCAGAATG
AATATTATCAAGCAGCAA
...

How do I know if node 2 is downstream (at the 3' end) of node 1? Is there a simple way to discard low coverage (low length e.g. <100) reads? (or does it not matter if the low coverages are there?)

And how do I merge the nodes into a single fasta so that I can BLAST it? I got the de Bruijn graph, generated in Bandage (in red the contigs with depth above 1000):

enter image description here

would it help to build a single fasta sequence?

Thank you

genome de contigs assembly spades bruijn • 2.2k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
5heikki 11k

How do I know if node 2 is downstream (at the 3' end) of node 1?

If you have a very similar reference genome, then you can map the contigs to that. If not, then without further sequencing with long reads and/or mate pairs you can't really know

Is there a simple way to discard low coverage (low length e.g. <100) reads?

Yes. You can e.g. first linearize your fasta and then filter based on the headers

And how do I merge the nodes into a single fasta so that I can BLAST it?

Why couldn't you blast without merging them?

ADD COMMENT
0
Entering edit mode

I don't know what is the reference sequence; the trick is, in fact, to find out where all these contigs belong. a single sequence is simpler to handle than thousands of them.

ADD REPLY
0
Entering edit mode

Why is the order of contigs important to you? The vast majority of genome assemblies in the NCBI GenBank are "contig level". I don't know what you're trying to achieve with blast, but assuming that you are using it from the command line, it makes little difference whether you have a contig-level assembly or one complete chromosome or whatever

ADD REPLY
0
Entering edit mode

the point is to provide a single genome. like this I have only a pile of fragments I don't know where they belong...

ADD REPLY
1
Entering edit mode

Unless you sequence again with longer reads or mate pairs that is how things will remain and there's nothing wrong with it. People work with contig-level assemblies. For the vast majority of things we do with DNA sequence, it really doesn't matter at all whether we have a contig-level assembly or something else. In my local GenBank Bacteria genome assembly database the distribution of assembly levels is like this:

816111 Contig
134727 Scaffold
25735  Complete Genome
4258   Chromosome

So in only about 3% of the genomes people can tell where exactly everything is relative to everything else. So what?

ADD REPLY
0
Entering edit mode

fair enough. I guess I'll have to blast the contigs and see if I can find a reference genome. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6