Entering edit mode
7.9 years ago
Jautis
▴
580
Hello,
I have samples with 25-30x coverage and I'd like to (randomly) downsample them each to 2x coverage. The closest I've found is samtools view -s _proportion_ -b SAMP.bam
which would let me downsample to a set proportion of the original data, but is there a general way I can input the genome (fasta) downsample the bam file to a desired coverage?
Thanks!
see : Capping coverage in bam file
Thanks, but I'm not interested in capping the coverage for part of the genome. I'd like to just get a bam that represents what the data would have been if we only sequenced to 2x so we can gauge how much sequencing data we actually need for a new method. That's why I'd like an average of 2x global coverage, not biased towards representing all regions
I see: so using samtools wgsim with -N (number of read pairs) = 2*genomeSize and then mapping the reads with BWA.
You can't take away information and expect to learn something new :P
Whatever you use to remove reads via downsampling will not match reality perfectly. For example you might take every other read, or you might make a function that removes a read with X% probability, or you might have it somehow related to GC%, etc. But at the end of the day you don't know what is going to happen the next time you do your sequencing. Otherwise, people would be able to do the reverse (less sequencing and apply this magic function to emulate doing more sequencing). If only.
Also i'd say 2x depth not 2x coverage (although use is fairly 50:50). Coverage is inherently a binary thing. You're either covered or your not. Depth is inherently a scale. But i'm not correcting you or anything - it's just a meme war thats being subconsciously fought and i'd rather depth won out over coverage. I'm pro-depth.
Obviously you can't take away data and learn something new about the biology.
I'm interested in downsampling so I know what level of sequencing is needed for future sequencing runs. We needed a few samples at very deep coverage, but for the rest we think somewhere between 1 and 4 will be sufficient. So it's a matter of downsampling the high coverage data and seeing to what extent the results are consistent to determine how much depth for the rest of the samples.
Sorry i didn't reply until now, i didn't get an alert or anything.
Well if you can't take away data and learn something new, what do you expect to learn from downsampling? There are instances, particularly in statistics, where down-sampled data can be very useful (for generating models to test on the rest of the data for example). However for trying to figure out how to go from an average genome-wide depth of 20x to 2x, surely you just need to sequence 10x less reads? I presume you will test a peak-caller or something on the downsampled data to see if it still generates good-enough results, but I'm not confident that will represent reality.
There's only one way to know for sure, and it sounds like you're in the perfect position to test it :) Create a downsampled data file with reformat.sh (or if you prefer a different down-sampling method i'll write you one in python), and then compare it's depth/coverage to some new sequencing that's actually only 2x coverdepths. Would be really interesting, particularly if you're doing SNP calling or something.
You should be able to use
reformat.sh
from BBMap suite to downsample. Read the in-line help to familiarize yourself with the options.