Hi,
I'm working with paired end Illumina RNA-seq data from two varieties of a non model plant (there isn't any sequence available). The aim of the work is to find SSRs and SNPs.
The data is not big, I have only 6 million (120 pb) reads per strain (one individual per strain sequenced). I started assembling the data with Trinity, but the resulting assemblies weren't very good.(I run separate assemblies for each strain (I didn't mix the data)).
Then my advisor told me that, since there aren't too many reads, I could try using an Overlaph Graph Assembler, like MIRA. However, the resulting assembly turned worse: too many isotigs of lower size, with less BUSCO hits, and almost 57,62% (3.575.408) reads excluded as "debris". Of this "debris", 82% (2.866.908) were excluded because of digital normalization.
The manifest file was
project = G5
job = est,denovo,accurate
parameters = -NW:cmrnl=no
readgroup = G5_paired
data= paired_reads_1.fastq paired_reads_2.fastq
technology = solexa
template_size = 200 -1 exclusion_criterion autorefine
segment_placement = ---> <--- exclusion_criterion
segment_naming = solexa
Later, I read here http://seqanswers.com/forums/showthread.php?t=8210 that a solution could be run MIRA iteratively: use the isotigs generated and the reads excluded (the "debris") to run MIRA with a reference, and then do it again with this output. But I am not sure if this would be correct. Also, since this thread was a bit old (2010), maybe the "EST" job settings for MIRA were improved in the meanwhile.
So, my questions are ¿What do you suggest me to do? ¿Is the MIRA approach reasonable, or should I drop it? If not ¿The iteration alternative is correct?. Also, ¿Could I use, somehow, the Trinity assembled contigs with MIRA?.
Thanks in advance!