Hello community!
I just started with bioinformatics, so I hope you can help me with understanding some basic things. I have RNA-Seq 100bp pair-ended reads from the Lepidoptera insect, for which there is no genome available. In total there were 2 treatments and one control, so for the assembly I merged all of them into two files (R1 and R2). So the goal is to make an assembly, and then continue with mapping, counts and statistic.
For the assembly I used next tools: Trinity, Velvet/Oases, Bridger and CLC assembler. From the last one I just got the ready-to-go assembly, to which I compared other assemblers.
- Trinity was executed with default parameters (k=25),
- Velvet: velveth/velvetg k=21,51,2; For each assembly run oases and compared n50 and total number of contigs. Chose the range of contigs from 25 till 39, with kmer=35, velveth/velvetg on the oases results and then merged them with oases.
- Bridger. Default parameters with kmers 25 and 27.
After I compared assemblers with Quast. As a result got the next table
Oases Trinity Bridger 25 Bridger 27 CLC
#contigs 10706 15579 14881 14632 19985
#contigs (>= 0 bp) 36576 40429 37011 36162 35589
Largest contig 27732 18062 27729 27887 40347
Total length 12451630 19701362 20030913 19559912 36090599
N50 1324 1528 1653 1651 2584
# N's per 100 kbp 0 0 0.05 0.21 212.61
# N's 0 0 11 41 76733
So, I could not reach the N50 of CLC, by using default parameters. But at least I can see how do they reach it. The number of mismatches in CLC is super high, so I was wondering - is it really worth to increase the total length and n50 by putting double amount of N's into the contigs?
And the next question - for some reason I though it would be a good idea to increase the input sequence length by merging R1 and R1, R2 and R2 reads. So basically doubling the amount of data assembler should deal with. So far I tested only Bridger (k=25), but the results are surprising, N50 increased to 1720, and the total length and the largest contigs became almost the same as in CLC results. What I think is happened that during the building of the graph low quality k-mers had a better chance to survive, thus increasing the contig length.
Thanks in advance for any kind of responds!
Increasing N50 is not always good, and this is particularly true for transcriptome assemblies. Similarly, it is not always helpful to feed more data to the assemblers - in case you want to use several fastq, consider performing digital normalization.
A good way of evaluating transcriptome assemblers is using Transrate - it maps the reads back to the assemblies and calculate some statistics to rank the assemblies.
edit: are you using QUAST or rnaQUAST?
Thanks for the answer, and for the tip with Transrate - I will definitely try it. I used QUAST, as far as I understood it just going through the whole assembly and calculating statistic without mapping counts to it. Some kind of raw statistical measurment
Yeah, quast is just basic contig statistics, make sure you use --scaffolds if it contains contigs combined with N's