Construction of single sequence assembly out of contigs
0
0
Entering edit mode
12 months ago
analyst ▴ 50

I have 396 contigs/scaffolds of Illumina short paired-end reads. Assembly of bacterial reads is generated through spades (de novo assembler). I have to make single sequence genome out of 396 contigs/scaffolds. I don't have long reads.

Kindly suggest any pipeline or tool that can be used to fill the gaps between contigs of Illumina reads.

Thanks

Contigs Bacteria Genome WGS • 1.7k views
ADD COMMENT
2
Entering edit mode

Use long reads or Hi-C.

ADD REPLY
0
Entering edit mode

Thanks colindaven!

I have only short reads. Aim of my study is to identify Transcription activator-like effectors (TALEs) from Xanthomonas bacteria. Overall about 18 TALEs have been reported to be present in xanthomonas. But I found above 60 TALEs in contigs that is not accurate. So I trimmed contigs w.r.t. contig length i.e., contigs with length <= 500bp were removed resulting in 7 TALEs.. So in order to get accurate results I want to assemble the contigs into one sequence.

And please suggest what criteria should I follow for minimum contig length in this scenario.

ADD REPLY
1
Entering edit mode

If you're only using short-read data you'll likely not be able to improve the continuity of your assembly without suitable data (i.e., Hi-C and long read data as suggested).

You could possibly use the synteny with a more contiguous genome assembly of a similar species to guide assembly if available. See the chromosemble tool of the Satsuma2 suite.

ADD REPLY
0
Entering edit mode

Thank you so much dthorbur! I will use it and will let you know.

ADD REPLY
0
Entering edit mode

Thanks dthorbur I used chromosemble that filled the gaps between most of contigs reducing contigs from 396 to 214.

ADD REPLY
2
Entering edit mode

I would be very careful moving forward with that. There are a lot of potential issues, and I would only ever use it if the reference genome you used was something like a different strain. You inherit misassemblies from the other genome and also incorporate incorrect structural rearrangements among genomes. As suggested this methodology is speculative at best.

Getting the number of contigs down at the cost of accuracy is a bad route to go down.

ADD REPLY
0
Entering edit mode

Yes reference genome is a different strain. You mean if reference genome is a different strain it will prone to less errors while scaffolding?

ADD REPLY
0
Entering edit mode

I observed that gaps are filled with Ns.

ADD REPLY
1
Entering edit mode

If you do not have long reads at the moment, the best way to improve your assembly is to get those. Getting more and longer sequences is also the only way to fill gaps (except with N's), any other approach based on homology is speculative at best. There is no pipeline or tool that can extract correct sequences where there is no information (assuming all your reads have already been used to generate the current assembly).

IF there is a reference genome of a very close relative or the same species, you can use homology-based scaffolding of the assembly. Have a look e.g. at the RagTag suite

If the closest reference is too distant, it might be best to leave the assembly as it is.

ADD REPLY
0
Entering edit mode

Thanks Michael I will use it and will share the outcome :)

ADD REPLY
0
Entering edit mode

I tried RagTag for scaffoldig of contigs. It generated 94 scaffolds out of 396 contigs. I used scaffold option. There are other options in RagTag too like correct, patch and merge. Do you suggest to use patch after scaffolding.

ADD REPLY
2
Entering edit mode

This looks actually quite like an "improvement", if it is one. Whether you use additional methods or not, the name of the software has been chosen for a reason, because that's literally what you are getting, you have no further information about the gaps and therefore you cannot draw conclusions about e.g. the synteny of genes in those regions. On the other hand, you identified a lot of TALE elements you were looking for. Do you have evidence that the unexpectedly high number of these elements is indeed due to duplicated contigs, or is it maybe real?

I am very skeptical of trying to "massage" the data in such a way that it fits the expectations, on the other hand, it is understandable that you wish to improve the quality of your draft assembly. However, you are not going to be able to close much more of the remaining gaps without further sequencing, and reaching a single chromosome/scaffold genome is not possible like that. So, whoever gave you the task to produce such an ideal genome, need to pay for long read sequencing and possibly Hi-C.

If you want to proceed with the patched genome, I recommend that you at the very least map back your sequencing reads to the genome to identify regions of low coverage and to run some assembly QC by Quast and BUSCO.

ADD REPLY
1
Entering edit mode

Inspector (only for long reads) and Merqury for short reads are also pretty good methods for doing a QC of your assembly.

ADD REPLY
0
Entering edit mode

Thanks colindaven!

ADD REPLY
0
Entering edit mode

Honestly I have not assessed duplication after assembly. However I used trimmomatic for data filtration before assembly.

Trimmomatic parameters:

trimmomatic PE -threads 80 input_1.fq.gz input_2.fq.gz trim_1_P.fq.gz trim_1_U.fq.gz trim_2_P.fq.gz trim_2_U.fq.gz ILLUMINACLIP:adapter.fasta:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:150

Reads were reduced from 9102732 to 7758478 sequences after filtration. Read length is 15bp before and after filtration since I set MINLEN:150. Do you think if reducing the threshold for read length will make any effect? Here is the fastqc report after filtration.

enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 1420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6