Question

Percentage coverage of reference genome by de novo genome

0

Entering edit mode

13 months ago

Lemonhope ▴ 10

Hey everyone

I need some help with SAMtools (v 1.3.1) and such. I have 2 files that I want to align, with the ultimate goal of understanding what percentage of the reference genome is covered by the query genome. Further description below:

File 1: WGS Illumina paired-end reads for a single plant sample that I have assembled using megahit, resulting in a FASTA file of contigs. This will act as my "reference genome".

File 2: FASTA file of contigs generated from de novo assembly of ddRADseq reads for several hundred samples. This will act as my "query"

All I want to do is align my de novo assembly to my "reference" in order to see what percentage of my reference is covered by the de novo assembly. eg: if there are a total of 1 000 000 bp in the reference sequence, how many of these bp have at least 1 bp of the de novo assembly aligned to it ?

I tried a naive approach by aligning them with BWA, then used SAMtools (V 1.3.1) to convert to a BAM file, sort the file and use SAMtools stats to assess the coverage but I am not sure if this approach is doing helping me achieve my aim.

Could anyone offer some advice on how I can achieve my aim ? I am new to the field of bioinformatics.

SAMtools BWA alignment ddRAD • 1.3k views

ADD COMMENT • link updated 13 months ago by Joe 22k • written 13 months ago by Lemonhope ▴ 10

0

Entering edit mode

If you have contigs for assemblies, using read-based approaches like samtools is probably not the way to go.

As you're interested in % coverage, not coverage depth, this is just an 'ordinary' alignment problem. Since it's effectively a multiple pairwise alignment, I'd suggest just using something like mummer as a starting point. Its relatively trivial to compute alignment statistics once you have the alignments.

Presumably you're treating each of the samples in your "query" as independent sequences, and not treating it as a single genome right?

ADD REPLY • link 13 months ago by Joe 22k

0

Entering edit mode

Thank you for the response

to answer you question, no, the ddRADseq data (short paired end illumina reads) for all the samples were used to generate the query file. the data was clustered (CD-HIT) and assembled to generate a de novo reference genome consisting of multiple contigs, resulting in a FASTA file with contigs resulting from the clustering of all the data

ADD REPLY • link 13 months ago by Lemonhope ▴ 10

0

Entering edit mode

OK, in that case it should be even easier as its just a single pairwise alignment, so the same advice holds.

ADD REPLY • link 13 months ago by Joe 22k

0

Entering edit mode

minimap2 can do this too, mummer (nucmer) is more accurate but slightly slower. If you want to do this as part of generating assembly statistics, use Quast which does this as part of its pipeline and also creates plots and other summary statistics.