bam2rpkm by bedtools
From alignment files (BAM format), multiple tools (htseq-counts, featureCounts) can count the number of reads located in a specific genomic regions(genes/exons/introns/ur/promoters/enhancers) based on the annotation files.
there are always the case others need RPKM values, instead of raw read counts, and you probably don't want to search public tools again and again.
I will show you a easy way to use bedtools
to solve that problem.
I had to mention that PPKM value is not a good way to do downstream analysis: "Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples". Wagner GP, Kin K, Lynch VJ. _Theory Biosci._ 2012. PubMed
Firstly create the coordinate of all of the exons
Download the file: CCDS.20160908.txt
, you can choose the versions and species you need.
cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
head exon_probe.hg38.gene.bed
The results from the script above would be :
chr1 69090 70007 OR4F5
chr1 450739 451677 OR4F29
chr1 685715 686653 OR4F16
chr1 925941 926012 SAMD11
chr1 930154 930335 SAMD11
chr1 931038 931088 SAMD11
chr1 935771 935895 SAMD11
chr1 939039 939128 SAMD11
chr1 939274 939459 SAMD11
chr1 941143 941305 SAMD11
calculate RPKM value from bam and gtf in batch
The formula of RPKM like below:
C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = exon length in base-pairs for a gene
Equation = RPKM = (10^9 * C)/(N * L)
So the script is :
bed=exon_probe.hg38.gene.bed
for bam in /home/project/*.bam
do
file=$(basename $bam )
sample=${file%%.*}
echo $sample
export total_reads=$(samtools idxstats $bam|awk -F '\t' '{s+=$3}END{print s}')
echo The number of reads is $total_reads
bedtools multicov -bams $bam -bed $bed |perl -alne '{$len=$F[2]-$F[1];if($len <1 ){print "$.\t$F[4]\t0" }else{$rpkm=(1000000000*$F[4]/($len* $ENV{total_reads}));print "$.\t$F[4]\t$rpkm"}}' >$sample.rpkm.txt
done
You may notice that the output should be 3 columns, first column is just the number of rows, second column is the number of reads for each exon, the last column is the RPKM value.
head blood.rpkm.txt
1 3538 40.5143632023362
2 1756 19.6581297588076
3 1793 20.0723386432472
4 102 15.0855916359498
5 75 4.35114155895529
6 24 5.04036238189381
7 97 8.21430025275033
8 132 15.5741534272
9 81 4.59762784834908
10 66 4.27808535500246
The reason why I write this script is to solve a small bug in conifer(COpy Number Inference From Exome Reads), I will explain it later.
Hi, Thank you for this. How can I get RPKM values per gene (name/symbol)? Thank you. Pradeep