I am new to RNA-seq data analysis and trying to understand how things work, please excuse me if I sound naive.
I have 1X150 bp Hiseq 3000 reads and the sequence quality looks good in FastQC. There was almost zero adapter contamination (few libraries showed 0.1% adapter sequence), so we proceeded without any adapter trimming. We used the most recent reference genome available in the database and average mapping percentage was 38% with TopHat. We got pretty good number of differentially expressing transcripts using cufflinks and most of them looked quite relevant.
Later, we realized that the mapping percentage looks too low, so shifted to Hisat2. After running the same untrimmed data with the same reference genome, there is considerable improvement in the mapping percentage (average is 55%).
My first question is why has the mapping percentage changed so much when I am still using the same input data (without any trimming) and same reference genome and secondly is it okay to proceed with this Hisat2 data or should I trim the adapters and run the Hisat2 again before using cufflinks?
Thanks in advance!!
Newer aligners will tend to perform better so it is not surprising that you are getting better results with HISAT2. Still 55% alignment is low so it would be worth checking a few of non-aligning reads by
blast+
at NCBI to see what they are aligning to. It is possible that you could have some contamination in your data or simply rRNA, that you can't use anyway.Thanks for your input genomax. This experiment was about host-pathogen interaction. I tried aligning the unmapped reads to the pathogen reference genome and the average percent was ~1%, still too much left over. Also, I performed Ribo-minus before the library preparation and the rRNA contamination was almost negligible. The FastQC shows this for the overrepresented sequences (top 2-an example):
The first sequence is Papaver sominferum 5.8s rRNA (NCBI BLAST) and the second one as well is a 5.8s rRNA, so I also suspected rRNA contamination but the percentage is not too much. I agree with your suggestion to perform Blast for all the unmapped reads.
I performed Blast with a set of unmapped reads and majority of them hit against rRNA. So, that means even after performing Ribo-minus, there is still rRNA contamination in my data. Thanks genomax for pointing that out.
Can you paste the HISAT2 alignment stats here? Meanwhile , you can understand the summary statistics here
What is the percentage of uniquely mapped reads?
EDIT#1
See this related post
Too low mapping percentage using HISAT2 on human reference genome.
Thanks Vijay for your input.
We were initially worried about the higher percentage of multiple aligned reads but our plant is an auto-tetraploid, may be that's one of the reason. I would like to know your interpretation of these stats. Thanks
Is the reference you are aligning to, a haploid copy?
I just looked at the reference article and it is mentioned that they used a "homozygous doubled-monoploid" to sequence and assemble the genome.
Still 50 % data mapped multiple times. Did you check for contamination?
Haven't checked the mapped reads yet.
It is not given the new software performs better than old/previous one, in general. However, the authors of the software (Tophat and HISAT2) recommended HISAT2 in place of Tophat. Given that, the assumption/expectation is that HISAT2 performs better and faster compared to Tophat, You can contact the HISAT2 development group to know better and in detail.
In general, reviewers expect peers to be aware of newer development in the respective fields and use them. Bioinformatics is no different. My free suggestion is to go with Hisat2 instead of Tophat. If there is no adapter contamination (as mentioned in OP), there is no point in doing it.
It would be interesting (to the community here) if you could post some comparisons between Hisat2 vs tophat which can be referred in future posts if some one else comes here with the same question. However that is not necessary and is at your discretion.
Instead of Cufflinks work flow, you can also look at stringtie, ballgown workflow. In my understanding, tophat-cufflinks-cummerbund are supposed to tuxedo suite protocol 1 and hisat2-stringtie-ballgown is tuxedo suite protocol 2.
We are also preferring to use Hisat2. I will surely post a general comparison of mapping percentage between TopHat and Hisat2. I am using Tuxedo protocol1 for the time being but will surely try protocol2 as well to see if there is any difference.
Thanks
What options did you use? Did you take care of applying the same rules for reporting an alignment?
Hi Martombo, both TopHat AND Hisat2 were performed at default parameter settings.