Question

Filter query coverage

0

Entering edit mode

2.8 years ago

a.mostafa5050 ▴ 10

Hi,

Could you please provide me with a tool that I could run after bowtie2 that could filter queries from bam files with for example 80% of short reads coverage (meaning short reads cover 80% of the query)?

alignment • 1.3k views

ADD COMMENT • link updated 2.8 years ago by seidel 11k • written 2.8 years ago by a.mostafa5050 ▴ 10

0

Entering edit mode

meaning short reads cover 80% of the query

what is "query" in that context ?

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I am aligning the short reads against the CARD database using Bowtie2. After filtering the mapped reads, sorting and indexing, I ran samtools idxstats. This gives me the queries with their abundances. But some of queries were not totally covered by shortreads. Therefore, I want to exclude the queries with low coverage. For example, keep only queries with at least 80% coverage.

ADD REPLY • link 2.8 years ago by a.mostafa5050 ▴ 10

score 0 · Answer 1 · 2022-02-04

You could use R and the GenomicRanges library, if you know R and the principles associated with this library. You can import a BAM file, convert it to a GRanges object, and call coverage(), which will create an RLE vector of coverage across the genome. Then import a BED file (or text file) of coordinates for which you'd like assess coverage, and use those coordinates to access your coverage object to query how many bases have coverage greater than zero, and divide this by the length of your query region to get a percentage. This answer assumes you know R, and are comfortable with the GenomicRanges library, and how to access RLE (Run Length Encoded) vectors, and even within this context, there are several different ways to approach it.

Given that each alignment target is a unit of interest, here's an example showing the percent of each target covered with just a few lines of code:

library(GenomicAlignments)

# import aligned reads to a GRanges object
gr <- as(readGAlignments("out.bam"), "GRanges")

# calculate coverage
cov <- coverage(gr)

# calculate percentage of bases with coverage greater than zero
coveragePercent <- lapply(cov, function(x) 100 * sum(x > 0)/length(x))

# view the results
unlist(coveragePercent)

Results:

CLX001 CLX002 CLX003 CLX004 CLX005 CLX006 CLX007 
41.2   0.0    37.9   69.0   74.7   81.1   0.0 
CLX008 CLX009 CLX010 CLX011 CLX012 CLX013 CLX014
0.0    51.9   60.7   0.0    36.3   55.9   95.9

You could also use a command-line approach using bedtools and the genomecov option to input a BAM file, report the depth at each genome position, and query how many zeros you get within a query region. Something like:

genomeCoverageBed -d -ibam out.bam

would give you per base coverage for every base, and you'd have to count how many zeros you get for each alignment target using a script or some fancy awk or other parsing tricks.