I was working on bam files generated from Tophat in order to convert into the raw count file.
Firstly, I just converted directly bam file into the raw count file using the following command.
htseq-count -f bam KDR_pre_thout/accepted_hits.bam genes.gtf > KDR_pre_raw_count.txt
which directly converts the bam file into the raw count file.
But I found the following paper, which suggest that converting the bam file into the sam file and then, convert sam file into the raw count file.
I thought it would be same.. but NOT.
samtools sort -n KDR_pre_thout/accepted_hits.bam KDR_pre_sn
samtools view -o KDR_pre_sn.sam KDR_pre_sn.bam
htseq-count -s no -a 10 KDR_pre_sn.sam gene.gtf > KDR_pre_sn.count
I have used these two ways in order to convert the bam file into the count file.
And I found that these results are different. Which I have to follow?? Second as paper mentioned(maybe sort is the important difference) Then, why??? Anybody knows why sort is necessary??
Please help me with this.!
Thanks.
Hello Devon
Thank you for your comments!. so you are mentioning that it should be same whether I have followed the first trial (directly convert from bam to count file without sort) or the second trail ( convert bam to sam and then sam to count file with sorting option).
Since result difference is relatively big ( as you mentioned if it is a bug) , htseq-count should be fixed. ( I hope Simon Anders could see this posting!! )
BTW, I just realized that my one of option for htseq-count was different, which was -s no option...
THe default was "yes". I didn't set up this option for the first trial, which implicitly means that "yes" but my second trial, it was set as "no".
I suspect that it would also cause the difference. so changed it.. as third trial, but all my three trial return the different count file, which is weired.!! hm...
option 1)
option 2)
option 3)
All the results from three trial looks different.
Could you please let me know what is "strand-specific assay" means???
Given that you switched between the strand-specific and unstranded counting as well as changing the minimum quality score for option 1, it's completely expected that you'd get different results. You need to do an apples-to-apples comparison for the results to be meaningful.
Regarding strand-specific libraries, I and others have explained this many many times. Here is one example. Just google around for more explanations.
Hello Devon
Thank you so much for your explanation!! It really helped!!! I will try to search and google more carefully!!! :)
BTW, as I reported before, given the exact same option (strand specific option, quality option), but still it returns the large difference in result when I try to do the sort the bam file or not.. I hope it can be fixed soon!! :)
did this issue fixed? or that you finally figured out why it is like that....