Difference in htseq-count values for sorted bam and unsorted bam by name
1
1
Entering edit mode
9.1 years ago
nalandaatmi ▴ 110

Dear All,

I am working on getting the read counts for some genes. Firstly, I aligned my trimmed RNASeq reads with human and mouse genome to get alignment file (accepted_hits.bam) in bam format using tophat. I provided htseq-count tool with two different bam files (sorted and unsorted).

Case1: I used the accepted_hits.bam from tophat directly as the input to the htseq-count and counted the reads for each gene.

$ htseq-count -f bam -s yes -i gene_id accepted_hits.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt

Note: By default accepted_hits.bam file is sorted based on coordinates by tophat.

Case2: I sorted the accepted_hits.bam file based on "NAMES" using samtools. Then provided as input to the htseq-count and counted the reads for each gene.

$ samtools sort -n accepted_hits.bam accepted_hits_sorted.bam
$ htseq-count -f bam -r name -s yes -i gene_name accepted_hits_sorted.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt
                           Unsorted-Bam      Sorted-Bam
Globin Genes               Stranded: YES     Stranded: YES
HBB                        36472             19800
HBA1                       20398             4249
HBA2                       144174            47553
HBG1                       2825              2384
HBG2                       2638              1942
HBD                        4                 3
HBE1                       0                 0
HBZ                        0                 0
HBQ1                       3                 2
MB                         0                 0
CYGB                       2                 1
NGB                        2                 1
Total Globin               206,518           75,935
__no_feature               94,038,451        51,420,979
__ambiguous                46,175            56,695
__too_low_aQual            0                 0
__not_aligned              0                 0
__alignment_not_unique     32,018,089        16,952,589
Total reads                126,969,861       68,900,778
% of Globin_reads          0.1626            0.1102

Can anyone explain why the total reads value is low for sorted bam compared to unsorted bam?

RNA-Seq htseq-count sequencing SNP • 9.8k views
ADD COMMENT
4
Entering edit mode
9.1 years ago

The name sorted results are correct, the coordinate ones are not because you lied to htseq-count and told it you had a name-sorted file (it's the default, you don't need to say -r name) instead of the coordinate-sorted file that you do have (accepted_hits.bam is coordinate sorted by default). Try adding -r pos with accepted_hits.bam and you'll likely get more or less identical results.

ADD COMMENT
0
Entering edit mode

Thanks Devon for your answers. I submitted the jobs for htseq-count with -r pos parameters. Currently, the jobs are running. I have another query,

Why the reads count 68,900,778 from htseq-count is not matching to either the trimmed reads count (66,113,117 trimmed reads pairs (R1 and R2))

or

the aligned reads count (Unique reads: 56,594,380 or 131,374,951) from bam file?

$samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
Unique Reads: 56,594,380
$samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951
ADD REPLY
0
Entering edit mode

Tophat may output multiple alignments for a given read/pair. htseq-count won't include them in any real counts (they'll be in "__alignment_not_unique"). You don't want to count reads, but fragments, so there are ~57 million of them.

ADD REPLY
0
Entering edit mode

Yeah that make sense, there are ~57 million unique fragments. So in that case, a fragment can be single read or combination of multiple reads.

The below details are from this post.

2) After aligning the trimmed reads using tophat

TopHat Output align summary:

Left reads: (Left reads alone aligned)
          Input     :  66,113,117
           Mapped   :  54,521,571 (82.5% of input)
            of these:   4,509,697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads: (Right reads alone aligned)
          Input     :  66,113,117
           Mapped   :  53,610,826 (81.1% of input)
            of these:   4,412,079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate. (is average of 82.5% and 81.1%)

Aligned pairs:  51,538,017 (Both left and right reads as a pair aligned)
     of these:   4,275,585 ( 8.3%) have multiple alignments
                  780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate. 

Left reads Input - Mapped reads = 11,591,546 reads (i.e 17.5% is not mapped from left)

Right reads Input - Mapped reads = 12,502,291 reads (18.9% is not mapped from right)

TopHat Output Prep_reads_info:

left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886
right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354

left_reads_in - left_reads_out = 33,231 reads (what happened to these reads?)

right_reads_in - right_reads_out= 42,763 (what happened to these reads?)

Both left reads input and Left_read_in have 66,113,117 sample value,

Questions:

  • What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?
  • What is the difference between right_reads_out (66,070,354) and right reads (53,610,826)?
  • What is the difference between 81.8% overall read mapping rate and 76.8% concordant pair alignment rate?
ADD REPLY
0
Entering edit mode

left_reads_out/right_reads_out are what remain after tophat filters for length. You can google "concordant alignment" for the rest of your question.

ADD REPLY

Login before adding your answer.

Traffic: 2184 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6