Question

How to collapse de novo contigs into a single sequence?

1

Entering edit mode

3.0 years ago

marongiu.luigi ▴ 730

Hello,

I generated a series of contigs using Spades. But how do I get a single sequence out the thousands of entries I got? For instance, I have:

>NODE_1_length_79_cov_5.000000
TGGATTACAAAGTTACCTGTCAAACGGTGCAATGAAGCCAAGTTAGAACTCGTCAGAATG
AATATTATCAAGCAGCAGA
>NODE_2_length_78_cov_4169.000000
TGGATTACAAAGTTACCTGTCAAACGGTGCAATGAAGCCAAGTTAGAACTCGTCAGAATG
AATATTATCAAGCAGCAA
...

How do I know if node 2 is downstream (at the 3' end) of node 1? Is there a simple way to discard low coverage (low length e.g. <100) reads? (or does it not matter if the low coverages are there?)

And how do I merge the nodes into a single fasta so that I can BLAST it? I got the de Bruijn graph, generated in Bandage (in red the contigs with depth above 1000):

enter image description here

would it help to build a single fasta sequence?

Thank you

genome de contigs assembly spades bruijn • 2.2k views

ADD COMMENT • link 3.0 years ago by marongiu.luigi ▴ 730

score 1 · Answer 1 · 2021-12-14

1

Entering edit mode

3.0 years ago

5heikki 11k

How do I know if node 2 is downstream (at the 3' end) of node 1?

If you have a very similar reference genome, then you can map the contigs to that. If not, then without further sequencing with long reads and/or mate pairs you can't really know

Is there a simple way to discard low coverage (low length e.g. <100) reads?

Yes. You can e.g. first linearize your fasta and then filter based on the headers

And how do I merge the nodes into a single fasta so that I can BLAST it?

Why couldn't you blast without merging them?

ADD COMMENT • link 3.0 years ago by 5heikki 11k

0

Entering edit mode

I don't know what is the reference sequence; the trick is, in fact, to find out where all these contigs belong. a single sequence is simpler to handle than thousands of them.

ADD REPLY • link 3.0 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

Why is the order of contigs important to you? The vast majority of genome assemblies in the NCBI GenBank are "contig level". I don't know what you're trying to achieve with blast, but assuming that you are using it from the command line, it makes little difference whether you have a contig-level assembly or one complete chromosome or whatever

ADD REPLY • link 3.0 years ago by 5heikki 11k

0

Entering edit mode

the point is to provide a single genome. like this I have only a pile of fragments I don't know where they belong...

ADD REPLY • link 3.0 years ago by marongiu.luigi ▴ 730

1

Entering edit mode

Unless you sequence again with longer reads or mate pairs that is how things will remain and there's nothing wrong with it. People work with contig-level assemblies. For the vast majority of things we do with DNA sequence, it really doesn't matter at all whether we have a contig-level assembly or something else. In my local GenBank Bacteria genome assembly database the distribution of assembly levels is like this:

816111 Contig
134727 Scaffold
25735  Complete Genome
4258   Chromosome

So in only about 3% of the genomes people can tell where exactly everything is relative to everything else. So what?

ADD REPLY • link 3.0 years ago by 5heikki 11k

0

Entering edit mode

fair enough. I guess I'll have to blast the contigs and see if I can find a reference genome. Thanks

ADD REPLY • link 3.0 years ago by marongiu.luigi ▴ 730