Does anyone know how to downsample a high coverage bam file such that it has a ceiling on number of reads at each position? Most of the downsampling tools (Picard, GATK walkers) downsamples the original bam file to a certain fraction of coverage, but that's not suited to my particular purposes. I found the --number option in the GATKCommandLine but for some strange reason it always stops working after about a hundred bases. If anyone has used the --number option in GATK successfully please let me know so I can investigate that more, if not please does anyone have any suggestion? Thanks!
I'm not sure why the GATK downsample doesn't meet this requirement. If you are using it when variant calling for instance it is downsampling to whatever you specify, the coverage at a position is the number of reads.
Hi, thanks for answering! I understand what you're saying, but I'm not using GATK or other variant calling tools to call variants. These reads don't map very well to the reference sequence and I intend to use the reads in the bam files for denovo assembly to get fasta files that can be used in multiple sequence alignment, which would then reveal the "variants" or differences between samples. One problem with that is the coverage (number of reads) per site is different among different samples and also across different sites. As the assemblers may have better information at sites with greater read depths and act different at sites with lower read depths, and there is a systematic difference between groups of samples I have in terms of mean whole sequence read depth, I am worried any variants I get from the multiple sequence alignment may be a result of assembly artefacts. Therefore the hope to downsample everyone's bam files by capping everyone's read depth at every site at a particular number.. Hope I am making sense? Do you have any suggestions how to do this better?