Entering edit mode
11 months ago
analyst
▴
50
I have 396 contigs/scaffolds of Illumina short paired-end reads. Assembly of bacterial reads is generated through spades (de novo assembler). I have to make single sequence genome out of 396 contigs/scaffolds. I don't have long reads.
Kindly suggest any pipeline or tool that can be used to fill the gaps between contigs of Illumina reads.
Thanks
Use long reads or Hi-C.
Thanks colindaven!
I have only short reads. Aim of my study is to identify Transcription activator-like effectors (TALEs) from Xanthomonas bacteria. Overall about 18 TALEs have been reported to be present in xanthomonas. But I found above 60 TALEs in contigs that is not accurate. So I trimmed contigs w.r.t. contig length i.e., contigs with length <= 500bp were removed resulting in 7 TALEs.. So in order to get accurate results I want to assemble the contigs into one sequence.
And please suggest what criteria should I follow for minimum contig length in this scenario.
If you're only using short-read data you'll likely not be able to improve the continuity of your assembly without suitable data (i.e., Hi-C and long read data as suggested).
You could possibly use the synteny with a more contiguous genome assembly of a similar species to guide assembly if available. See the chromosemble tool of the Satsuma2 suite.
Thank you so much dthorbur! I will use it and will let you know.
Thanks dthorbur I used chromosemble that filled the gaps between most of contigs reducing contigs from 396 to 214.
I would be very careful moving forward with that. There are a lot of potential issues, and I would only ever use it if the reference genome you used was something like a different strain. You inherit misassemblies from the other genome and also incorporate incorrect structural rearrangements among genomes. As suggested this methodology is speculative at best.
Getting the number of contigs down at the cost of accuracy is a bad route to go down.
Yes reference genome is a different strain. You mean if reference genome is a different strain it will prone to less errors while scaffolding?
I observed that gaps are filled with Ns.
If you do not have long reads at the moment, the best way to improve your assembly is to get those. Getting more and longer sequences is also the only way to fill gaps (except with N's), any other approach based on homology is speculative at best. There is no pipeline or tool that can extract correct sequences where there is no information (assuming all your reads have already been used to generate the current assembly).
IF there is a reference genome of a very close relative or the same species, you can use homology-based scaffolding of the assembly. Have a look e.g. at the RagTag suite
If the closest reference is too distant, it might be best to leave the assembly as it is.
Thanks Michael I will use it and will share the outcome :)
I tried RagTag for scaffoldig of contigs. It generated 94 scaffolds out of 396 contigs. I used scaffold option. There are other options in RagTag too like correct, patch and merge. Do you suggest to use patch after scaffolding.
This looks actually quite like an "improvement", if it is one. Whether you use additional methods or not, the name of the software has been chosen for a reason, because that's literally what you are getting, you have no further information about the gaps and therefore you cannot draw conclusions about e.g. the synteny of genes in those regions. On the other hand, you identified a lot of TALE elements you were looking for. Do you have evidence that the unexpectedly high number of these elements is indeed due to duplicated contigs, or is it maybe real?
I am very skeptical of trying to "massage" the data in such a way that it fits the expectations, on the other hand, it is understandable that you wish to improve the quality of your draft assembly. However, you are not going to be able to close much more of the remaining gaps without further sequencing, and reaching a single chromosome/scaffold genome is not possible like that. So, whoever gave you the task to produce such an ideal genome, need to pay for long read sequencing and possibly Hi-C.
If you want to proceed with the patched genome, I recommend that you at the very least map back your sequencing reads to the genome to identify regions of low coverage and to run some assembly QC by Quast and BUSCO.
Inspector (only for long reads) and Merqury for short reads are also pretty good methods for doing a QC of your assembly.
Thanks colindaven!
Honestly I have not assessed duplication after assembly. However I used trimmomatic for data filtration before assembly.
Trimmomatic parameters:
Reads were reduced from 9102732 to 7758478 sequences after filtration. Read length is 15bp before and after filtration since I set MINLEN:150. Do you think if reducing the threshold for read length will make any effect? Here is the fastqc report after filtration.