Entering edit mode
8.0 years ago
biofalconch
★
1.3k
Hello Everyone,
Very recently I completed a de novo assembly using rnaSPAdes (using a kmer size of 55) on some data I had, approximately over 1 billion reads. However, it seems like I am getting a lot of contigs (over 2 million), here is what they look like on terms of Length and Coverage.
Coverage Distribution:
Is there a way to filter these out or a way to get a reduced number of contigs (kmer length?)?
What is the expected genome size? What kind of data is this (Illumina, cycles, PE?). A billion reads may be an overkill for a relatively small genome.
Hello, it is RNA-seq data, Illumina PE, 101 nt. I am trying to assemble different conditions from the same experiment,since I didn't count with the computational resources in the past to do so. As for the size, I was expecting maybe around 200,00 contigs.
As Genomax said, the expected genome size is very important in this case. The clade is also useful, as is the expected ploidy, the full set of preprocessing you did prior to assembly (such as contaminant removal), sample gathering/prep, etc. Basically, the more information, the better; for all I can tell right now, you might be producing an excellent assembly of a plant leaf meta-transcriptome.
Hello Brian, thanks for the reply. So as far as I know:
It's always funny to me when some random "primitive" species has a genome size many times larger than human :) I've heard that there are amoebae with much larger genomes, as well (>10Gbp). Previously, the largest I'd heard of was the Loblolly pine with 22 Gbp, but this takes the cake. Go vertebrates!
So, on-topic - for diploid assemblies, large numbers of contigs are not necessarily unexpected. This would be easier if you had DNA data too. Different organisms have different heterozygousity rates, which, for different assemblers, yield varying numbers of contigs. Fungi, for example, can have 1/30 het rates, which wreak havoc with assemblers. Do you have an idea what the heterozygousity rate of your salamander is?
Also, there are some decontamination procedures that might be useful. And considering they're slimy... did you take any special precautions to remove skin-dwelling organisms? And have you done any sort of digital decontamination?
I guess it is always interesting how some organisms hold such big genomes (just like Polychaos dubium, that holds 670 Gbp ), what they use it for it's anyone guess.
I do not expect too much heterozygousity, since the organisms used on the lab for this species have been inbreeding for the last couple decades (since 1890ish).
As far as I know, no decontamination procedures were followed, since the samples were obtained from embryo. I didn't perform any digital decontamination also.
You may want to go back to Trinity (if you have not done so). rnaSPAdes appears to be pretty new and a more established program may give you better results.
That said hardware requirements for Trinity are stiff and with such a large dataset you are bound to need hundreads of GB of RAM. Consider using galaxy at Indiana if you don't have the resources locally available.
Just a quick follow up, I used BUSCO to look for orthologs in SPAdes and Trinity assemblies. I found that around 1000 genes are missing in the SPAdes assembly compared to Trinity, but right now I am doing a new assembly with a different kmer size to see if this changes.
I noticed this was around 2.6 years ago. Has rna-spades improved since then?
I went through the changelogs and they seem to have improved the assembler, specially by implementing a multi kmer assembly. However, I do not know if this has a real impact on the assembly, might be worth to take a look.