QUESTION
I have a bed file & alignment file (bam).
I'd like to know if the reads contained inside the bam are biased towards GC rich regions.
RESEARCH
Picard CollectGcBiasMetrics does a great job with whole genome sequencing but doesn't take bed files as input.
Picard HsMetrics can tell me GC % & coverage per region but not all regions are the same size.
I thought about taking the original bedfile and creating a new bed file of sliding 100 base pair windows with bedtools makewindows and then analyzing with Picard HsMetrics. The problem with that is if I use a window of 100 bp and step of 10 bp then large capture regions will have more windows than small capture regions.
For example:
ORIGINAL BED FILE ----> NEW BED FILE
100 bp region ----> 1 window of 100 base pairs, 1x coverage in new bed file
1000 bp region ----> 90 windows of 100 base pairs. 10x the size but 90x as many windows.
If I pad things out then there will be windows that have 0 depth of coverage by the bam file and these will be clustered around areas with few base pairs (small capture regions).
Hello jnowacki ,
could you please explain what you hope to find out and why the output of HsMetrics isn't suitable for your needs?
Thanks.
fin swimmer
I'm trying to detect GC bias. The problem with the normal HsMetrics output is not all of the bed regions are the same size so any binning/graphing will heavily weight the results towards whatever GC content is contained in average of the smaller regions.