Hello,
in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:
Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.
However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.
Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.
Can someone shed light on this? Is this due to the library normalization? If I calculate correlation for the outputs of the two different methods on the same experiement, I think this should outrule this normalization, or not?
smoothScatter( log10(fpkm + 1), log10(bedtools + 1), colramp=tim.colors, xlab="FPKM", ylab="Bedtools with length norm")
when you do the bedtools analysis, what is the input alignment file?
@Istvan Albert: I use the -abam flag to provide the accepted_hits.bam file from the tophat output