Question

How to extend assembled PACBIO contigs for scaffolding

0

Entering edit mode

3.5 years ago

K ▴ 10

Hi,

I have assembled a genome (PacBio HiFi) into contigs using IPA and Hifiasm. Those contigs were aligned to the Illumina mate pair reads of different sizes 2,5,7,10 kb using BWA-MEM so as to extend it for scaffolding. This generated .sam file. I tried SAMTOOLS after BWA-MEM alignment and received .bam, coverage, stat and depth file.

I am able to plot the depth of the individual contig/s and tried IGV tools. However, I am not sure about

how to use this information in extending the contigs and scaffolding?
which reads of Illumina are not mapped to the contigs?
how to sort those illumina reads based on the depth/coverage on each contigs?
Are there any software/tool for this purpose?

Any suggestions or comments?

Thank you

pacbio scaffold • 3.6k views

ADD COMMENT • link 3.5 years ago by K ▴ 10

1

Entering edit mode

What have you tried to get info on scaffolding approaches? Doing some googling will quickly point you to some interesting tools (btw, most of those will do their own "alignments" so perhaps there will be no need to do this a priori).

Since you already used PacBio HiFi reads for your assembly it might very well be the illumina MP data will not add much additional info to it. Illumina MP data is notoriously 'dirty' and the avg PacBio read will likely already be longer than your MP data. So be prepared to only have marginal gain (if any at all) to your assembly stats.

ADD REPLY • link 3.5 years ago by lieven.sterck 15k

1

Entering edit mode

Agreed. If there is sufficient HiFi coverage, extending the dataset with short-read (mate pair) data is likely to give you very marginal contiguity gains if any, unless that short read data is Hi-C chromatin linkage data.

In which case scaffolding algorithms such as 3D-DNA & SALSA2 exist.

ADD REPLY • link 3.5 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thanks. I will skip ILLUMINA MP than and proceed with my assembly. Thanks for the recommending those tools. I will have a loo k at them.

I read about SALSA and it requires Hi-C data. Is there a way to do it without it? I do not have Hi-C data.

Thank you

ADD REPLY • link 3.5 years ago by K ▴ 10

0

Entering edit mode

No, that's what I was saying, that 3D-DNA & SALSA will only work if you have Hi-C data. They are Hi-C scaffolding algorithms.

ADD REPLY • link 3.5 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thank you.

ADD REPLY • link 3.5 years ago by K ▴ 10

0

Entering edit mode

Thank you for your reply. I appreciate that.

I am going to use SSPACE-STANDARD (https://github.com/nsoranzo/sspace_basic) and Gapcloser package for (https://github.com/BGI-Qingdao/TGS-GapCloser).

Do think, I can use PacBio assembly for SVs and SNPs analysis as it is (without adding illumina MP)? Is there any additional steps to perform before doing so. I did polishing twice on the PacBio assembly.

ADD REPLY • link 3.5 years ago by K ▴ 10

1

Entering edit mode

Polishing PacBio HiFi data is not necessary, the data is already very accurate. Polishing is generally only necessary for high error reads such as PacBio CLR or ONT data. In some instances, polishing a HiFi assembly with short reads can actually decrease the quality.

In general, if you have more than 20X+ of HiFi data, you will be able to generate decent assemblies, as well as call SNPs & SVs. Of course more coverage is better (up to a point) but 20X is a solid starting point.

ADD REPLY • link 3.5 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thank you for your reply. I appreciate that. I am getting good assembly. Now will call SNPs and SVs.

How can I get the coverage stat in this case? DO you recommend any tools/software?

ADD REPLY • link 3.5 years ago by K ▴ 10

1

Entering edit mode

There are many ways to get the coverage stats:

If you know the genome size than the coverage is a simple calculation of total_input_bp / genome_size = coverage

If you don't know the genome size you can count kmers in your HiFi reads with a tool like yak (https://github.com/lh3/yak) to generate a histogram that should give you a general idea of depth / coverage.

You can look in the log file that was generated from hifiasm. The histogram inside will give you a general idea of the input coverage for the assembly, and will be similar to what was generated for yak

Or you can map the HiFi reads back to an assembly and measure depth with a tool like "mosdepth" or "samtools depth" to get a very accurate per bp depth count.

ADD REPLY • link 3.5 years ago by gconcepcion ▴ 410

0

Entering edit mode

Thank you for reply.

Here genome size is the reference genome size or my assembled genome size after IPA assembly? Reference is ~303 Mb and my assembly is ~313 Mb (contigs without scaffolding)

I have assembly from IPA. HIFIASM was not that appropriate.

What do you mean by this? "you can map the HiFi reads back to an assembly and measure depth with a tool like "mosdepth" or "samtools depth" to get a very accurate per bp depth count"

IPA contigs mapping on raw .fastq assembly file using samtools?

Is is good to do like this:

./minimap2 -ax map-hifi IPA.fasta pacbio-ccs.fq.gz > aln.sam

samtools view -S -b aln.sam > aln.sam.bam

samtools flagstat aln.sam.bam

samtools sort aln.sam.bam > aln.sam.sorted.bam

samtools index aln.sam.sorted.bam

samtools depth aln.sam.sorted.bam > aln.sam.sorted.bam.depth

samtools coverage aln.sam.sorted.bam > aln.sam.sorted.bam.coverage

Thank you

ADD REPLY • link 3.5 years ago by K ▴ 10