Question

Normalize bedtools gene coverage by total number of reads

1

Entering edit mode

7.6 years ago

David ▴ 240

Hi, I have run a Hiseq 2x150bp from metagenomics samples.

I have computed the gene coverage for a list of genes for a given sample. For each sample i take a subset of genes i´m interested in and mapped all reads to each of these genes generating a bam file that I use as input to bedtools as follows

bedtools genomecov -ibam subsetgenes_sample_A.bam
bedtools genomecov -ibam subsetgenes_sample_B.bam
bedtools genomecov -ibam subsetgenes_sample_C.bam
....

Here below the output for one sample. I have added column names for clarity.

           V1           V2      V3       V4        V5
        k141_25_1       1       64      165     0.387879
        k141_25_1       2       100     165     0.606061
        k141_25_1       3       1       165     0.00606061
        k141_34_1       2       11      726     0.0151515
        k141_34_1       3       55      726     0.0757576
        k141_34_1       4       36      726     0.0495868
        k141_34_1       5       101     726     0.139118
        k141_34_1       6       123     726     0.169421
        k141_34_1       7       195     726     0.268595

In order to compute gene coverage i run coverage = (V2*V3)/V4. The problem i have is that i have several samples so ideally i should normalize each sample by the total number of reads and multiply by 1M.

Would it make sense to normalize as follows normalized_genes = coverage(from above formula)/totalReadspersample * 1000000

Thanks for comments.

bedtools normalization metagenomics • 5.0k views

ADD COMMENT • link 7.5 years ago by David ▴ 240

0

Entering edit mode

Thanks for your comments Carlo, Can you generate coverage using FeatureCounts from bam files ??

ADD REPLY • link 7.5 years ago by David ▴ 240

0

Entering edit mode

Ok found it coverageCounts. Great I´ll normalize with Deseq2 , thanks

ADD REPLY • link 7.5 years ago by David ▴ 240

0

Entering edit mode

Hi, Couldn´t manage to get genes coverage using coverageCounts. It returns binary files. How would you get the gene coverage as numeric ?

ADD REPLY • link 7.5 years ago by David ▴ 240

0

Entering edit mode

You can get the raw counts for your set of specific genes when you have Bam or SAM files. Later you can load the raw counts inot R and normalize using DESeq2 or EdgeR

ADD REPLY • link 7.5 years ago by EVR ▴ 610

score 1 · Answer 1 · 2017-05-22

Hi,

It looks like you want to compare the expression of a few genes between conditions. While it is a good start, I think there are a few issues with your method :

bedtools is very useful but more specific tools, such as HTseqCounts or featureCounts, are better to count reads on genes. This involves less steps than with your method and it takes care properly of ambiguous reads.
RPKM is a poor 'between samples' normalization method. Most people here recommend either DESeq2 or edgeR instead.

Hope this helps,

Carlo