Entering edit mode
3.4 years ago
K
▴
10
Hi,
I have assembled a genome (PacBio HiFi) into contigs using IPA and Hifiasm. Those contigs were aligned to the Illumina mate pair reads of different sizes 2,5,7,10 kb using BWA-MEM so as to extend it for scaffolding. This generated .sam file. I tried SAMTOOLS after BWA-MEM alignment and received .bam, coverage, stat and depth file.
I am able to plot the depth of the individual contig/s and tried IGV tools. However, I am not sure about
- how to use this information in extending the contigs and scaffolding?
- which reads of Illumina are not mapped to the contigs?
- how to sort those illumina reads based on the depth/coverage on each contigs?
- Are there any software/tool for this purpose?
Any suggestions or comments?
Thank you
What have you tried to get info on scaffolding approaches? Doing some googling will quickly point you to some interesting tools (btw, most of those will do their own "alignments" so perhaps there will be no need to do this a priori).
Since you already used PacBio HiFi reads for your assembly it might very well be the illumina MP data will not add much additional info to it. Illumina MP data is notoriously 'dirty' and the avg PacBio read will likely already be longer than your MP data. So be prepared to only have marginal gain (if any at all) to your assembly stats.
Agreed. If there is sufficient HiFi coverage, extending the dataset with short-read (mate pair) data is likely to give you very marginal contiguity gains if any, unless that short read data is Hi-C chromatin linkage data.
In which case scaffolding algorithms such as 3D-DNA & SALSA2 exist.
Thanks. I will skip ILLUMINA MP than and proceed with my assembly. Thanks for the recommending those tools. I will have a loo k at them.
I read about SALSA and it requires Hi-C data. Is there a way to do it without it? I do not have Hi-C data.
Thank you
No, that's what I was saying, that 3D-DNA & SALSA will only work if you have Hi-C data. They are Hi-C scaffolding algorithms.
Thank you.
Thank you for your reply. I appreciate that.
I am going to use SSPACE-STANDARD (https://github.com/nsoranzo/sspace_basic) and Gapcloser package for (https://github.com/BGI-Qingdao/TGS-GapCloser).
Do think, I can use PacBio assembly for SVs and SNPs analysis as it is (without adding illumina MP)? Is there any additional steps to perform before doing so. I did polishing twice on the PacBio assembly.
Polishing PacBio HiFi data is not necessary, the data is already very accurate. Polishing is generally only necessary for high error reads such as PacBio CLR or ONT data. In some instances, polishing a HiFi assembly with short reads can actually decrease the quality.
In general, if you have more than 20X+ of HiFi data, you will be able to generate decent assemblies, as well as call SNPs & SVs. Of course more coverage is better (up to a point) but 20X is a solid starting point.
Thank you for your reply. I appreciate that. I am getting good assembly. Now will call SNPs and SVs.
How can I get the coverage stat in this case? DO you recommend any tools/software?
There are many ways to get the coverage stats:
If you know the genome size than the coverage is a simple calculation of total_input_bp / genome_size = coverage
If you don't know the genome size you can count kmers in your HiFi reads with a tool like yak (https://github.com/lh3/yak) to generate a histogram that should give you a general idea of depth / coverage.
You can look in the log file that was generated from hifiasm. The histogram inside will give you a general idea of the input coverage for the assembly, and will be similar to what was generated for yak
Or you can map the HiFi reads back to an assembly and measure depth with a tool like "mosdepth" or "samtools depth" to get a very accurate per bp depth count.
Thank you for reply.
Here genome size is the reference genome size or my assembled genome size after IPA assembly? Reference is ~303 Mb and my assembly is ~313 Mb (contigs without scaffolding)
I have assembly from IPA. HIFIASM was not that appropriate.
What do you mean by this? "you can map the HiFi reads back to an assembly and measure depth with a tool like "mosdepth" or "samtools depth" to get a very accurate per bp depth count"
IPA contigs mapping on raw .fastq assembly file using samtools?
Is is good to do like this:
./minimap2 -ax map-hifi IPA.fasta pacbio-ccs.fq.gz > aln.sam
samtools view -S -b aln.sam > aln.sam.bam
samtools flagstat aln.sam.bam
samtools sort aln.sam.bam > aln.sam.sorted.bam
samtools index aln.sam.sorted.bam
samtools depth aln.sam.sorted.bam > aln.sam.sorted.bam.depth
samtools coverage aln.sam.sorted.bam > aln.sam.sorted.bam.coverage
Thank you