Downsampling By Having A Ceiling On Number Of Reads At Each Site
1
3
Entering edit mode
10.8 years ago
caina89 ▴ 30

Does anyone know how to downsample a high coverage bam file such that it has a ceiling on number of reads at each position? Most of the downsampling tools (Picard, GATK walkers) downsamples the original bam file to a certain fraction of coverage, but that's not suited to my particular purposes. I found the --number option in the GATKCommandLine but for some strange reason it always stops working after about a hundred bases. If anyone has used the --number option in GATK successfully please let me know so I can investigate that more, if not please does anyone have any suggestion? Thanks!

bam coverage • 5.5k views
ADD COMMENT
0
Entering edit mode

I'm not sure why the GATK downsample doesn't meet this requirement. If you are using it when variant calling for instance it is downsampling to whatever you specify, the coverage at a position is the number of reads.

ADD REPLY
0
Entering edit mode

Hi, thanks for answering! I understand what you're saying, but I'm not using GATK or other variant calling tools to call variants. These reads don't map very well to the reference sequence and I intend to use the reads in the bam files for denovo assembly to get fasta files that can be used in multiple sequence alignment, which would then reveal the "variants" or differences between samples. One problem with that is the coverage (number of reads) per site is different among different samples and also across different sites. As the assemblers may have better information at sites with greater read depths and act different at sites with lower read depths, and there is a systematic difference between groups of samples I have in terms of mean whole sequence read depth, I am worried any variants I get from the multiple sequence alignment may be a result of assembly artefacts. Therefore the hope to downsample everyone's bam files by capping everyone's read depth at every site at a particular number.. Hope I am making sense? Do you have any suggestions how to do this better?

ADD REPLY
1
Entering edit mode
10.8 years ago

Seems like there ought to be a better way to do it, but the work flow I can come up with off the top of my head is:

1) calculate depth in windows across the genome. (lots of info on doing this in other biostar questions)

2) use samtools to grab all reads from windows below your threshold, stick them in a SAM file.

samtools view -R goodregions.bed file.bam >newfile.sam

3) for each window that exceeded your threshold, calculate the fraction of the reads you want to keep, then downsample those regions accordingly:

samtools view  file.bam 1:1-10000 | sed -n '0~100p' >>newfile.sam

4) sort your sam, convert to bam, and index it.

ADD COMMENT

Login before adding your answer.

Traffic: 1563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6