hi folks,
apologies if this has been answered elsewhere. I'm using read mapping to quantitate the abundance of viral metagenome assembled genomes (MAGs) across samples and I'd like to do a bit of data cleaning that's proving to be a little trickier than anticipated. I'm mapping reads from several samples against a set of MAGs that I've cross assembled from those samples. I use BBmap for this and then convert the SAM output into sorted BAM files that I feed to CoverM to convert into a counts per million (CPM) sample matrix. CoverM is great but only enables abundance filtering based on the fraction of a contig that is covered by reads in that sample (eg. 10%). This gets tricky when you have contigs that span an order of magnitude or more in length and so I'd like to do some abundance filtering based on absolute read counts and total alignment length as well.
Is there a way via samtools or otherwise to filter SAM/BAM files to remove (or set the counts to zero) alignments that are shorter than combined length x and include less than y reads? The intended effect of this would be to prevent single pair alignments and alignments that include several reads but only span 300bps or so from being included in abundance calculations.
thanks! bryan