I am pretty new to genome assembly and in particular to mira3 and i have couple of questions regarding that.
What exactly is the templates size in mira3. I couldn't find a proper definition of the same in the manual. My fragment size before ligating adapters is ~250bp and after library construction it was found to be ~350. My read length is 260bp. In this scenario what is the exact template size they wanted in the configuration file?
I am working with bacterial miseq genome data and when i used mira3 to assembly the contigs, i found that there are ~1400 contigs in the final results file. Do you think are they too many? Does this has happened because of my wrong template size specification in the configuration file during mira3 assembly run?
The template size usually refers to the distance between the 5' ends of paired end data, in other words the length of the DNA between the adaptors. So in your case it would be ~250 bp.
I've never used mira for Illumina data so I can't comment on whether 1400 contigs is good or not, obviously it also depends on how repetitive the genome your trying to sequence is. Another thing to consider is preprocessing your data. From your description, your read length is as long (or longer) than the template length, which means that you may be sequencing into the adaptor sequence in your reads. This will cause serious problems for the genome assembler as many of the reads will end in the same DNA sequence, which doesn't even originate from your genome! I would suggest a tool like SeqPrep, which trims adaptors and filters reads based on quality. After preprocessing you could give assembly another go and see if you get better results. Finally it never hurts to get a second opinion, there are heaps of de novo assemblers out there for illumina data, you could try one and see if you get better results. I would suggest spades, ray, or velvet
Thanks cts for the detailed explanation. After going through some forums i found that actually for Mira you need two sizes, one is the actual insert size (which is ~250bp) and the fragment size (which is insert size (250) + total read length (520) = 770). Do you think this is correct? Also regarding adapter trimming, i found the following on their manaual.....
"Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave Illumina data alone! (really, I mean it)".
So i assume i cannot just trim the adapters from the reads then. Please let me know what you think.
So I think that there is some confusion with the terminology that different people use for 'insert size'. Many people (including myself) refer to the insert size as the distance between the 5' ends of the reads which will be the length of the fragment of DNA being sequenced. Other people use the insert size to describe the distance between the 3' ends of the reads, in other words the part of the DNA fragment that is between the two reads (not actually sequenced). So to illustrate this:
>---| read1
=================== DNA fragment
|-----< read 2
>>---------------<< insert size using definition 1
>>----<< insert size using definition 2
From your original question you say that your fragment size is 250bp and that your read length is 260bp, so to illustrate:
>--------| read1 ( 260bp)
========== DNA fragment (250bp)
|--------< read2 (260bp)
(Please let me know if this interpretation of your data is wrong.)
So what you've done is sequence the same bit of DNA twice with both reads. In this case the insert size using mira's terminology would be 0 and the fragment size would be 250. However considering that read1 and read2 will be mostly identical you could either just assemble with either of them as single-end data and get similar results. Alternatively you could overlap the pairs using seqprep to get higher quality reads and then assemble as single-end data
Hi,
Thanks again for the clarification. Your interpretation is spot on atleast in my case. Regarding the first figure i would normally say your definition to me seems correct. Actually this is not my experiment but i am helping other postdoc in the lab to analyze the data. Anyway i have just started running the analysis with single end reads and i will let you know if this actually improves the assembly. If this doesn't help then will try the "seqprep" method.
Thanks
Upendra
Hi,
Even only using single end reads mira couldn't make good assembly. There are around 1300 contigs with N50 of only 4662. Though this is much better than Paired End assembly but i would like to make a better assembly. What do you think i need to make changes to get a better assembly?
You could try a different assembler as I mention in my original answer, other than that I'm not sure. Your data is suboptimal because the DNA fragment size is so short and it may be that what you're sequencing has a lot of repeats in it which is breaking the assembly into many contigs.
Thanks cts for the detailed explanation. After going through some forums i found that actually for Mira you need two sizes, one is the actual insert size (which is ~250bp) and the fragment size (which is insert size (250) + total read length (520) = 770). Do you think this is correct? Also regarding adapter trimming, i found the following on their manaual..... "Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave Illumina data alone! (really, I mean it)". So i assume i cannot just trim the adapters from the reads then. Please let me know what you think.
Thanks Upendra
Hey,
So I think that there is some confusion with the terminology that different people use for 'insert size'. Many people (including myself) refer to the insert size as the distance between the 5' ends of the reads which will be the length of the fragment of DNA being sequenced. Other people use the insert size to describe the distance between the 3' ends of the reads, in other words the part of the DNA fragment that is between the two reads (not actually sequenced). So to illustrate this:
From your original question you say that your fragment size is 250bp and that your read length is 260bp, so to illustrate:
(Please let me know if this interpretation of your data is wrong.)
So what you've done is sequence the same bit of DNA twice with both reads. In this case the insert size using mira's terminology would be 0 and the fragment size would be 250. However considering that read1 and read2 will be mostly identical you could either just assemble with either of them as single-end data and get similar results. Alternatively you could overlap the pairs using seqprep to get higher quality reads and then assemble as single-end data
Hi, Thanks again for the clarification. Your interpretation is spot on atleast in my case. Regarding the first figure i would normally say your definition to me seems correct. Actually this is not my experiment but i am helping other postdoc in the lab to analyze the data. Anyway i have just started running the analysis with single end reads and i will let you know if this actually improves the assembly. If this doesn't help then will try the "seqprep" method. Thanks Upendra
Hi, Even only using single end reads mira couldn't make good assembly. There are around 1300 contigs with N50 of only 4662. Though this is much better than Paired End assembly but i would like to make a better assembly. What do you think i need to make changes to get a better assembly?
You could try a different assembler as I mention in my original answer, other than that I'm not sure. Your data is suboptimal because the DNA fragment size is so short and it may be that what you're sequencing has a lot of repeats in it which is breaking the assembly into many contigs.