Hello,
I generated a series of contigs using Spades. But how do I get a single sequence out the thousands of entries I got? For instance, I have:
>NODE_1_length_79_cov_5.000000
TGGATTACAAAGTTACCTGTCAAACGGTGCAATGAAGCCAAGTTAGAACTCGTCAGAATG
AATATTATCAAGCAGCAGA
>NODE_2_length_78_cov_4169.000000
TGGATTACAAAGTTACCTGTCAAACGGTGCAATGAAGCCAAGTTAGAACTCGTCAGAATG
AATATTATCAAGCAGCAA
...
How do I know if node 2 is downstream (at the 3' end) of node 1? Is there a simple way to discard low coverage (low length e.g. <100) reads? (or does it not matter if the low coverages are there?)
And how do I merge the nodes into a single fasta so that I can BLAST it? I got the de Bruijn graph, generated in Bandage (in red the contigs with depth above 1000):
would it help to build a single fasta sequence?
Thank you
I don't know what is the reference sequence; the trick is, in fact, to find out where all these contigs belong. a single sequence is simpler to handle than thousands of them.
Why is the order of contigs important to you? The vast majority of genome assemblies in the NCBI GenBank are "contig level". I don't know what you're trying to achieve with blast, but assuming that you are using it from the command line, it makes little difference whether you have a contig-level assembly or one complete chromosome or whatever
the point is to provide a single genome. like this I have only a pile of fragments I don't know where they belong...
Unless you sequence again with longer reads or mate pairs that is how things will remain and there's nothing wrong with it. People work with contig-level assemblies. For the vast majority of things we do with DNA sequence, it really doesn't matter at all whether we have a contig-level assembly or something else. In my local GenBank Bacteria genome assembly database the distribution of assembly levels is like this:
So in only about 3% of the genomes people can tell where exactly everything is relative to everything else. So what?
fair enough. I guess I'll have to blast the contigs and see if I can find a reference genome. Thanks