Question

Low mapping percentage with TopHat

0

Entering edit mode

7.0 years ago

Sapi • 0

I am new to RNA-seq data analysis and trying to understand how things work, please excuse me if I sound naive. I have 1X150 bp Hiseq 3000 reads and the sequence quality looks good in FastQC. There was almost zero adapter contamination (few libraries showed 0.1% adapter sequence), so we proceeded without any adapter trimming. We used the most recent reference genome available in the database and average mapping percentage was 38% with TopHat. We got pretty good number of differentially expressing transcripts using cufflinks and most of them looked quite relevant. Later, we realized that the mapping percentage looks too low, so shifted to Hisat2. After running the same untrimmed data with the same reference genome, there is considerable improvement in the mapping percentage (average is 55%).
My first question is why has the mapping percentage changed so much when I am still using the same input data (without any trimming) and same reference genome and secondly is it okay to proceed with this Hisat2 data or should I trim the adapters and run the Hisat2 again before using cufflinks? Thanks in advance!!

rna-seq • 2.8k views

ADD COMMENT • link 7.0 years ago by Sapi • 0

1

Entering edit mode

Newer aligners will tend to perform better so it is not surprising that you are getting better results with HISAT2. Still 55% alignment is low so it would be worth checking a few of non-aligning reads by blast+ at NCBI to see what they are aligning to. It is possible that you could have some contamination in your data or simply rRNA, that you can't use anyway.

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

Thanks for your input genomax. This experiment was about host-pathogen interaction. I tried aligning the unmapped reads to the pathogen reference genome and the average percent was ~1%, still too much left over. Also, I performed Ribo-minus before the library preparation and the rRNA contamination was almost negligible. The FastQC shows this for the overrepresented sequences (top 2-an example):

   Overrepresented sequences
    Sequence                                                                                                                  Count Percentage  Possible Source
    CTCGGCTCTCGCATCGATGAAGAACGTAGCGAAATGCGATACTTGGTGTG   362979  1.4301295212831344 No Hit
    CCCAGGCAGACGTGCCCTCGGCCTAATGGCTTCGGGCGCAACTTGCGTTC  265085  1.0444292483844513  No Hit

The first sequence is Papaver sominferum 5.8s rRNA (NCBI BLAST) and the second one as well is a 5.8s rRNA, so I also suspected rRNA contamination but the percentage is not too much. I agree with your suggestion to perform Blast for all the unmapped reads.

ADD REPLY • link updated 7.0 years ago by GenoMax 153k • written 7.0 years ago by Sapi • 0

0

Entering edit mode

I performed Blast with a set of unmapped reads and majority of them hit against rRNA. So, that means even after performing Ribo-minus, there is still rRNA contamination in my data. Thanks genomax for pointing that out.

ADD REPLY • link 7.0 years ago by Sapi • 0

0

Entering edit mode

Can you paste the HISAT2 alignment stats here? Meanwhile , you can understand the summary statistics here

What is the percentage of uniquely mapped reads?

EDIT#1

See this related post

Too low mapping percentage using HISAT2 on human reference genome.

ADD REPLY • link 7.0 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thanks Vijay for your input.

Here is an example of my Hisat2 alignment stats:
36006390 reads; of these:
  36006390 (100.00%) were unpaired; of these:
    13250438 (36.80%) aligned 0 times
    4644716 (12.90%) aligned exactly 1 time
    18111236 (50.30%) aligned >1 times
63.20% overall alignment rate

We were initially worried about the higher percentage of multiple aligned reads but our plant is an auto-tetraploid, may be that's one of the reason. I would like to know your interpretation of these stats. Thanks

ADD REPLY • link updated 7.0 years ago by GenoMax 153k • written 7.0 years ago by Sapi • 0

0

Entering edit mode

our plant is an auto-tetraploid

Is the reference you are aligning to, a haploid copy?

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

I just looked at the reference article and it is mentioned that they used a "homozygous doubled-monoploid" to sequence and assemble the genome.

ADD REPLY • link 7.0 years ago by Sapi • 0

0

Entering edit mode

Still 50 % data mapped multiple times. Did you check for contamination?

ADD REPLY • link 7.0 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Haven't checked the mapped reads yet.

ADD REPLY • link 7.0 years ago by Sapi • 0

0

Entering edit mode

My first question is why has the mapping percentage changed so much when I am still using the same input data (without any trimming) and same reference genome

It is not given the new software performs better than old/previous one, in general. However, the authors of the software (Tophat and HISAT2) recommended HISAT2 in place of Tophat. Given that, the assumption/expectation is that HISAT2 performs better and faster compared to Tophat, You can contact the HISAT2 development group to know better and in detail.

secondly is it okay to proceed with this Hisat2 data or should I trim the adapters and run the Hisat2 again before using cufflinks?

In general, reviewers expect peers to be aware of newer development in the respective fields and use them. Bioinformatics is no different. My free suggestion is to go with Hisat2 instead of Tophat. If there is no adapter contamination (as mentioned in OP), there is no point in doing it.

It would be interesting (to the community here) if you could post some comparisons between Hisat2 vs tophat which can be referred in future posts if some one else comes here with the same question. However that is not necessary and is at your discretion.

Instead of Cufflinks work flow, you can also look at stringtie, ballgown workflow. In my understanding, tophat-cufflinks-cummerbund are supposed to tuxedo suite protocol 1 and hisat2-stringtie-ballgown is tuxedo suite protocol 2.

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

We are also preferring to use Hisat2. I will surely post a general comparison of mapping percentage between TopHat and Hisat2. I am using Tuxedo protocol1 for the time being but will surely try protocol2 as well to see if there is any difference.

Thanks

ADD REPLY • link 7.0 years ago by Sapi • 0

0

Entering edit mode

What options did you use? Did you take care of applying the same rules for reporting an alignment?

ADD REPLY • link 7.0 years ago by Martombo ★ 3.2k

0

Entering edit mode

Hi Martombo, both TopHat AND Hisat2 were performed at default parameter settings.

ADD REPLY • link 7.0 years ago by Sapi • 0