Hello everyone!
I am working with Tophat2 and HTSeq-count until now but, I want to use HISAT2 and learn about it. For that, I have been trying and working with this tool this week. Everything was fine until I tried to calculate the count from the SAM files using HTSeq-count. When I used this tool for the count, all the transcript counts in the output files are 0. This represents a real problem for me because I need to use HTSeq-count to keep the same distribution than the data that I already have.
So, Is it possible to use these two tools together? or Is there another tool that mantain the same distribution in the output file that HTSeq-count but works with HISAT2?
Both HISAT2 and HTSeq-count are in their last version.
Thank you in advance for your time.
The first thing that you should check is if the chromosome identifiers in your fasta genome used for creating the index are the same as in your gtf used for counting. Most common mistake.
I am using the release 89 of the human genome for both the fasta and the gtf files. The FASTA file is the Homo_sapiens.GRCh38.cdna and the GFT is the Homo_sapiens.GRCh38.89
You aligned to the cdna? That's not right, you should align to the genome. The GTF doesn't correspond to the cdna fasta, only to the genome fasta.
Ok, I though that the CDNA is used with tools like HISAT2, SALMON, KALLISTO,etc... So I'll go to try with the whole genome such as in TOPHAT2. I hope this fix my problem. Thank you!
The CDS FASTA transcriptome is indeed used for Kallisto and Salmon, but not HISAT2
I just aligned the sample with the whole genome, but the problem persist. I have zeros in all the counts of each transcript. This is the message that I obtain when I create the SAM file before the count file:
And this is the output of htseq-count:
I am using the Homo_sapiens.GRCh38.90 for both the FASTA and GTF files. Is it possible that the version of the genome should be the problem?
Why do you need to use HISAT2? - to identify novel transcripts?
What I would do is the following:
I am working integrating microarray gene expression with RNA-seq gene expression. For that, I use tophat with htseq-count for the read counts. The problem is that tophat2 is very slowly, for that reason I want to accelerate the pipeline with Hisat2.
HTSeq-count is the only tool that maintain the gene expression values in the same range and distribution than microarray, at least as far as I know. For that reason I need solve this problem.
Thank you again for your supporting!
...but, TopHat2 / HISAT2 are mainly designed for de novo transcript identification and de novo splice isoform identification. They can also be used for standard RNA-seq differential expression analysis, of course.
HTseq does not produce counts on the same distribution as microarray. HTseq produces expression values that follow a negative binomial distributon, whilst microarray expression values are binomially-distributed log (base 2) values.
In my opinion, it is not a good idea to attempt to merge RNA-seq data with microarray data - sorry / lo siento. It would be better to use one dataset as the training and the other as a validation.
Kevin
A large fraction of your reads do not appear to align. You should investigate that first (grab a sample of those reads and blast at NCBI to see if you have a problem of contamination of some sort) before worrying about 0 counts.