Question

How to calculate percent coverage of each read for nanopore

0

Entering edit mode

3.6 years ago

sonsunjirachote • 0

Hello, I am assigned to investigate the read the cover the reference more than 80%. I know that samtools can be used to calculate coverage and depth for the interested regions but what about each read? For example, I have read 1-10 that cover region A, I want to know their percent coverage for all of each read. Expect output: read 1 70% cover, leftmost position 1 (like sam format column) Is there any tools or script that can be used for this? Thanks in advance.

alignment nanopore sequencing • 3.4k views

ADD COMMENT • link 3.6 years ago by sonsunjirachote • 0

1

Entering edit mode

It seems like several concepts may be confused. "Reads" don't have a coverage, they're just one sequenced molecule. Instead, mapped reads can be characterized by their "length" and "mismatch rate". So are you trying to get the coverage of the sequence, get reads where >80% match the reference, get reads that cover 80% of the reference (read length), or something else entirely?

ADD REPLY • link 3.6 years ago by Jautis ▴ 580

0

Entering edit mode

It seems like several concepts may be confused. "Reads" don't have a coverage, they're just one sequenced molecule. Instead, mapped reads can be characterized by their "length" and "mismatch rate". So are you trying to get the coverage of the sequence, get reads where >80% match the reference, get reads that cover 80% of the reference (read length), or something else entirely?

ADD REPLY • link 3.6 years ago by Jautis ▴ 580

0

Entering edit mode

Yes, I want the reads that cover 80% of my interested region.

ADD REPLY • link 3.6 years ago by sonsunjirachote • 0

score 1 · Answer 1 · 2021-04-27

1

Entering edit mode

3.6 years ago

colindaven 7.0k

Have a look at something like sambamba depth to obtain coverage stats (for a bam mapped to a reference genome)

From a bash script I use frequently:

    input=$i
    sec_input=${input%%.bam}

    window=100000
    overlap=50000
    covMax=999999999
    threads=1

sambamba depth window -t $threads --max-coverage=$covMax --window-size=$window --overlap $overlap -c 0.00001 ${sec_input}.bam > ${sec_input}_cov_window.txt &

ADD COMMENT • link 3.6 years ago by colindaven 7.0k

0

Entering edit mode

Thanks, but what is the input file?

ADD REPLY • link 3.6 years ago by sonsunjirachote • 0

1

Entering edit mode

Reads aligned to your reference of interest in BAM format.

ADD REPLY • link 3.6 years ago by GenoMax 147k

0

Entering edit mode

If all this is not clear I'd recommend some basics:

Galaxy - try the M. tuberculosis Variant Analysis tutorial.

https://training.galaxyproject.org/training-material/topics/variant-analysis/

And or there is lots of stuff on youtube on basic NGS bioinformatics.

ADD REPLY • link 3.6 years ago by colindaven 7.0k

0

Entering edit mode

I have tried with my sorted bam file but the message show that sambamba-depth: All files must be coordinate-sorted. Processing reference #3317 (tufA) Processing reference #3318 (fusA) Then it generated empty file, only the header column show in file.

ADD REPLY • link 3.6 years ago by sonsunjirachote • 0

0

Entering edit mode

It looks like you aligned to genes individually ? The standard approach is to align the reads to the complete genome.

ADD REPLY • link 3.6 years ago by colindaven 7.0k

0

Entering edit mode

Oh, I see. the region that I interested is the gene. Anyways thanks.

ADD REPLY • link 3.6 years ago by sonsunjirachote • 0

1

Entering edit mode

You could use samtools idxstats your.bam to get number of reads that are aligned to each gene in your reference.

ADD REPLY • link 3.6 years ago by GenoMax 147k

1

Entering edit mode

I think you need to adjust your question. What did you do ? Which organism ? What are your research questions. Normally, align to the genome, then view the coordinate sorted bam in IGV.

For gene positions, get a GFF3 or GTF file and view this together with your reads. Good luck

ADD REPLY • link 3.6 years ago by colindaven 7.0k

0

Entering edit mode

Thanks for advice! I have question about mapping to reference genome (FASTA) with gene annotation file (GTF or GFF) what is difference of output that they generate in SAM file when compare to mapping with mapping to reference genome (FASTA) only. Thanks in advance!

ADD REPLY • link 3.6 years ago by sonsunjirachote • 0