Hi, I'm analyzing several RNA-seq samples downloaded from ENCODE. I found in some BAM files an extremely large number of reads were mapped to several chromosomes or patches. For example, in ENCFF754JEN, there are about 68M, 35M, and 34M reads in chr21
, chr22_KI270733v1_random
, and chrUn_GL000220v1
, respectively, while the largest chromosome chr1
has only about 19M reads?
I wonder why the reads are unevenly distributed in different chromosomes? What should I do to avoid getting biased results from these BAM files? Thanks!
$ samtools idxstats ENCFF754JEN.bam | cut -f 1,3 | sort -k2,2nr
chr21 68288192
chr22_KI270733v1_random 35290528
chrUn_GL000220v1 33978632
chr1 18807868
chr6 11533310
chr11 11474490
chr19 11069604
chr12 9934778
chr2 9306572
chr17 9227112
chr7 8790550
chr16 8079172
...
chrUn_GL000220v1
contains a complete 45S rRNA gene so I would guess you are right on point