Hello everyone.
I am going to construct the rice pangenome based on an iterative assembly approach. I am planning to use paired-end read sequencing data. I will map the paired end reads to the reference genome and extract the unmapped reads. Then, I will perform the de novo assembly for extracted unmapped reads into novo contigs. The newly assembled sequenced will be added to the current reference genome to generate the entire pangenome.
But I am not clear, how to assess the quality of the assembly of unmapped reads ?
I have read some papers: that use :
- quast (Quality Assessment Tool for Genome Assemblies) / or
- they remap the raw paired-end reads back to the assembled sequence
And then, they calculate the statistic of mapping: flagstat command.
(As my understanding of this step, they want to map the original reads back to the assembled contigs is to check whether all positions in the genome are "covered" by at least one read. "Covered" here means that there is at least one read from the original data that can be mapped (aligned) to that position on the contigs. If every position in the genome is covered by at least one read, it means that no part of the genome is missing in the assembly process.
if there is any position without any mapped reads, it could indicate loss or omission in the assembly process. This might be due to regions of the genome that are challenging to assemble or could be attributed to other issues.)
Thank you so much.
validate ( click on the green check mark on the left side of the answer )/comment your previous questions How to construct a blast database for non-green plant species and fungi ; download BLAST NCBI nt database
Thank you for reminding me. I did it.
So you would not do any additional testing to ensure that the "unmapped" reads are actually rice sequences before you try to assemble them? If they are not from rice why would you want to add this extraneous sequence back in your assembly?
Thank you for your suggestion. when I finish assembling unmapped reads, I am going to check the contamination and redundant contigs in my assembled sequence and remove these contaminated sequences before I add these novo contigs into the pangenome. ( I plan to use FCS-GX to detect the contamination.
In case my assembled sequence is not present of contamination. So is there any suggestion for me to check the quality of the assembled sequence of unmapped reads? I appreciate your advise. Thank you