Dear All,
I am working on getting the read counts for some genes. Firstly, I aligned my trimmed RNASeq reads with human and mouse genome to get alignment file (accepted_hits.bam) in bam format using tophat. I provided htseq-count tool with two different bam files (sorted and unsorted).
Case1: I used the accepted_hits.bam from tophat directly as the input to the htseq-count and counted the reads for each gene.
$ htseq-count -f bam -s yes -i gene_id accepted_hits.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt
Note: By default accepted_hits.bam
file is sorted based on coordinates by tophat.
Case2: I sorted the accepted_hits.bam
file based on "NAMES" using samtools. Then provided as input to the htseq-count and counted the reads for each gene.
$ samtools sort -n accepted_hits.bam accepted_hits_sorted.bam
$ htseq-count -f bam -r name -s yes -i gene_name accepted_hits_sorted.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt
Unsorted-Bam Sorted-Bam
Globin Genes Stranded: YES Stranded: YES
HBB 36472 19800
HBA1 20398 4249
HBA2 144174 47553
HBG1 2825 2384
HBG2 2638 1942
HBD 4 3
HBE1 0 0
HBZ 0 0
HBQ1 3 2
MB 0 0
CYGB 2 1
NGB 2 1
Total Globin 206,518 75,935
__no_feature 94,038,451 51,420,979
__ambiguous 46,175 56,695
__too_low_aQual 0 0
__not_aligned 0 0
__alignment_not_unique 32,018,089 16,952,589
Total reads 126,969,861 68,900,778
% of Globin_reads 0.1626 0.1102
Can anyone explain why the total reads value is low for sorted bam compared to unsorted bam?
Thanks Devon for your answers. I submitted the jobs for htseq-count with -r pos parameters. Currently, the jobs are running. I have another query,
Why the reads count 68,900,778 from htseq-count is not matching to either the trimmed reads count (66,113,117 trimmed reads pairs (R1 and R2))
or
the aligned reads count (Unique reads: 56,594,380 or 131,374,951) from bam file?
Tophat may output multiple alignments for a given read/pair. htseq-count won't include them in any real counts (they'll be in "__alignment_not_unique"). You don't want to count reads, but fragments, so there are ~57 million of them.
Yeah that make sense, there are ~57 million unique fragments. So in that case, a fragment can be single read or combination of multiple reads.
The below details are from this post.
2) After aligning the trimmed reads using tophat
TopHat Output align summary:
TopHat Output Prep_reads_info:
Both left reads input and Left_read_in have 66,113,117 sample value,
Questions:
left_reads_out/right_reads_out are what remain after tophat filters for length. You can google "concordant alignment" for the rest of your question.