Hi all,
my goal today is to build some custom statistics along the human reference genome. Then I will be able to make comparisons with the content of a given BAM file, for instance.
Let's take a simple & concrete example :
Compute every 50bp the GC content in a 100bp window (overlapping sliding window)
So here, I would need to get a 100bp sequence each 50bp along the reference, and then compute my GC content, and store it somewhere. But how do I efficiently scan the human fasta reference ?
What libraries/tools do exist to accomplish such a task ? (samtools, Picard APIs ?)
Thanks, Tony
Thanks Aaron. Yes, I was digging into BEDTools as well. Just 3 quick clarifications please :) 1 - No particular issues if I use GRC37 reference instead of UCSC hg19 ? 2 - Is there an BEDTools API or it's just command line tools ? 3 - What does BEDTools use inside to scan the fasta ref :) ?
There is an C++ API, but it is admittedly not yet well-documented, though the code themselves illustrate the usage. The pybedtools Python library provides the best API for programming around the functionality present in BEDTools.
"No particular issues if I use GRC37 reference instead of UCSC hg19". Not that I know of. The only issues that I can think of are the way the chromosomes are named must be consistent (e.g., 1 != chr1) and the reported chrom sizes much match what is in your FASTA file.
OK, Thank you very much.