I need to perform SNV analysis in a cohort - for that I need to find "good" regions that are more or less covered in a set of several hundred of WES samples (sequenced with different enrichment kits).
Is there a tool that takes a set of BAM files, and provides a BED file with positions covered with at least X reads?
You could simply bin the genome into chunks of x bp and then make a count matrix with featureCounts. This will give a column = samples and rows = regions matrix.
Well 5bp bins will give like 600mio of them for human, no surprise this might be a bit much ;-)
You can also only bin the entire exom, e.g. from GENCODE (so everything that is annotated) which is probably as comprehensive as it gets right now. RefSeq is always more conservative than GENCODE. This will reduce the amount of DNA you have consider for coverage calculation less than 10% of total.
Or quarter or simply half the feature size. bedtools slop can extend every feature by a given fraction, e.g. 0.25, that might be the easiest and should catch everything.
will try this thanks! I've tried to bin into 5 bp windows with another tool - segfault, while for e.g. 500bp worked perfectly...
Well 5bp bins will give like 600mio of them for human, no surprise this might be a bit much ;-) You can also only bin the entire exom, e.g. from GENCODE (so everything that is annotated) which is probably as comprehensive as it gets right now. RefSeq is always more conservative than GENCODE. This will reduce the amount of DNA you have consider for coverage calculation less than 10% of total.
Thanks, it sounds reasonable. What would you say about the expanding GENCODE exons +/- 20 bp? or 30? to catch potential splicing-affecting variants?
Or quarter or simply half the feature size.
bedtools slop
can extend every feature by a given fraction, e.g. 0.25, that might be the easiest and should catch everything.Thanks a lot! Yes, 0.25 sounds very reasonable =) will do it tomorrow!